StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Guibao Shen^1,3*†, Yihua Du^1*, Wenhang Ge^1,3*†, Jing He¹, Chirui Chang³, Donghao Zhou⁴, Zhen Yang¹, Luozhou Wang¹, Xin Tao³, Ying-Cong Chen^1,2‡

¹HKUST(GZ) ²HKUST ³Kling Team, Kuaishou Technology ⁴CUHK

^*Equal contribution ^†This work was conducted during the author's internship at Kling ^‡Corresponding author

arXiv Code 🤗 Model

Please enable sound for the best experience

Showcase Video

Introduction Video

Abstract

The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage “Depth-Warp-Inpaint” (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. All data and code will be made publicly available.

Contributions

We introduce UniStereo, the first large-scale, unified dataset for stereo video conversion, featuring both parallel and converged formats to enable fair benchmarking and model comparisons.
We propose StereoPilot, an efficient feed-forward architecture that leverages a pretrained video diffusion transformer to directly synthesize the novel view. It overcomes the limitations of “Depth-Warp-Inpaint” methods (error propagation, depth ambiguity, and format-specific assumptions) without iterative denoising overhead, while integrating a domain switcher and cycle consistency loss for robust multi-format processing.
Extensive experiments show StereoPilot significantly outperforms state-of-the-art methods on our UniStereo benchmark in both visual quality and efficiency.

Method & Pipeline

The training framework of the proposed StereoPilot. StereoPilot uses a single-step feed-forward architecture (Diffusion as Feed-Forward) that incorporates a learnable domain switcher s to unify conversion for both parallel and converged stereo formats. The entire model is optimized using a cycle-consistent training strategy, combining reconstruction and cycle-consistency losses to ensure high fidelity and precise geometric alignment.

The blue and orange lines represent the Left-to-Right and Right-to-Left reconstruction processes, and the orange dashed line denotes the L → R → L cycle-consistency path.

UniStereo Dataset & Data Preparation

Parallel vs. Converged Stereo

In the parallel setup, when both eyes observe the same subject, the projected image points on the left and right views are denoted as X_L and X_R, and their absolute difference |X_L - X_R| defines the disparity s. According to geometric relationships derived from similar triangles, b, f, d, and s satisfy an inverse proportionality between disparity and depth when the baseline b and focal length f remain constant. In the converged configuration, a Zero-disparity Projection Plane is present—objects in front of this plane yield positive disparity, while those behind it produce negative disparity.

Converged Stereo Examples

Parallel Stereo Examples

UniStereo Construction Pipeline

We use green icons with numbered steps to depict the Stereo4D pipeline: starting from the raw VR180 videos, we set hfov = 90° and specify the projection resolution to produce the final left- and right-eye monocular videos. Simultaneously, blue icons with numbered steps denote the 3DMovie pipeline: we segment the source films into clips, filter out non-informative segments, convert from side-by-side (SBS) to left/right monocular views, and remove black borders. All resulting videos are captioned using ShareGPT4Video.

Experiments

Qualitative Results

Qualitative Results. Our method achieves more accurate disparity estimation and preserves finer visual details on both Parallel and Converged data compared with existing baselines.

Quantitative Results

Method	Venues	Stereo4D-Parallel Format					3D Movie-Converged Format					Latency↓
Method	Venues	SSIM↑	MS-SSIM↑	PSNR↑	LPIPS↓	SIOU↑	SSIM↑	MS-SSIM↑	PSNR↑	LPIPS↓	SIOU↑	Latency↓
StereoDiffusion	CVPR'24	0.642	0.711	20.541	0.245	0.252	0.678	0.612	20.695	0.341	0.181	60 min
StereoCrafter	arXiv'24	0.553	0.562	17.673	0.298	0.226	0.706	0.799	23.794	0.203	0.213	1 min
SVG	ICLR'25	0.561	0.543	17.971	0.368	0.220	0.653	0.553	19.059	0.426	0.166	70 min
ReCamMaster	ICCV'25	0.542	0.525	17.229	0.312	0.239	--	--	--	--	--	15 min
M2SVid	3DV'26	--	0.915	26.200	0.180	--	--	--	--	--	--	--
Mono2Stereo	CVPR'25	0.649	0.721	20.894	0.222	0.241	0.795	0.810	25.756	0.191	0.201	15 min
StereoPilot (Ours)	--	0.861	0.937	27.735	0.087	0.408	0.837	0.872	27.856	0.122	0.260	11 s

In-Domain results on UniStereo dataset (GT vs. Ours)

Switch between Anaglyph and SBS Side-by-Side (SBS) views.

Out-of-Domain results on Native 2D Movies dataset (OOD)

Generalization on Native 2D Movies (No Ground Truth). Switch between Anaglyph and SBS Side-by-Side (SBS).

Please enable sound for the best experience

Effectiveness of Domain Switcher

Parallel vs. Converged Comparison

Switch between Anaglyph and SBS Side-by-Side (SBS) views.

BibTeX

@misc{shen2025stereopilot,
        title={StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors},
        author={Shen, Guibao and Du, Yihua and Ge, Wenhang and He, Jing and Chang, Chirui and Zhou, Donghao and Yang, Zhen and Wang, Luozhou and Tao, Xin and Chen, Ying-Cong},
        year={2025},
        eprint={2512.16915},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.16915}, 
      }