The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage “Depth-Warp-Inpaint” (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. All data and code will be made publicly available.
The training framework of the proposed StereoPilot. StereoPilot uses a single-step feed-forward architecture (Diffusion as Feed-Forward) that incorporates a learnable domain switcher s to unify conversion for both parallel and converged stereo formats. The entire model is optimized using a cycle-consistent training strategy, combining reconstruction and cycle-consistency losses to ensure high fidelity and precise geometric alignment.
The blue and orange lines represent the Left-to-Right and Right-to-Left reconstruction processes, and the orange dashed line denotes the L → R → L cycle-consistency path.
In the parallel setup, when both eyes observe the same subject, the projected image points on the left and right views are denoted as XL and XR, and their absolute difference |XL - XR| defines the disparity s. According to geometric relationships derived from similar triangles, b, f, d, and s satisfy an inverse proportionality between disparity and depth when the baseline b and focal length f remain constant. In the converged configuration, a Zero-disparity Projection Plane is present—objects in front of this plane yield positive disparity, while those behind it produce negative disparity.
We use green icons with numbered steps to depict the Stereo4D pipeline: starting from the raw VR180 videos, we set hfov = 90° and specify the projection resolution to produce the final left- and right-eye monocular videos. Simultaneously, blue icons with numbered steps denote the 3DMovie pipeline: we segment the source films into clips, filter out non-informative segments, convert from side-by-side (SBS) to left/right monocular views, and remove black borders. All resulting videos are captioned using ShareGPT4Video.
Qualitative Results. Our method achieves more accurate disparity estimation and preserves finer visual details on both Parallel and Converged data compared with existing baselines.
| Method | Venues | Stereo4D-Parallel Format | 3D Movie-Converged Format | Latency↓ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SSIM↑ | MS-SSIM↑ | PSNR↑ | LPIPS↓ | SIOU↑ | SSIM↑ | MS-SSIM↑ | PSNR↑ | LPIPS↓ | SIOU↑ | |||
| StereoDiffusion | CVPR'24 | 0.642 | 0.711 | 20.541 | 0.245 | 0.252 | 0.678 | 0.612 | 20.695 | 0.341 | 0.181 | 60 min |
| StereoCrafter | arXiv'24 | 0.553 | 0.562 | 17.673 | 0.298 | 0.226 | 0.706 | 0.799 | 23.794 | 0.203 | 0.213 | 1 min |
| SVG | ICLR'25 | 0.561 | 0.543 | 17.971 | 0.368 | 0.220 | 0.653 | 0.553 | 19.059 | 0.426 | 0.166 | 70 min |
| ReCamMaster | ICCV'25 | 0.542 | 0.525 | 17.229 | 0.312 | 0.239 | -- | -- | -- | -- | -- | 15 min |
| M2SVid | 3DV'26 | -- | 0.915 | 26.200 | 0.180 | -- | -- | -- | -- | -- | -- | -- |
| Mono2Stereo | CVPR'25 | 0.649 | 0.721 | 20.894 | 0.222 | 0.241 | 0.795 | 0.810 | 25.756 | 0.191 | 0.201 | 15 min |
| StereoPilot (Ours) | -- | 0.861 | 0.937 | 27.735 | 0.087 | 0.408 | 0.837 | 0.872 | 27.856 | 0.122 | 0.260 | 11 s |
Switch between
Anaglyph and
Side-by-Side (SBS) views.
Generalization on Native 2D Movies (No Ground Truth). Switch between
Anaglyph and
Side-by-Side (SBS).
Please enable sound for the best experience
Switch between
Anaglyph and
Side-by-Side (SBS) views.
@misc{shen2025stereopilot,
title={StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors},
author={Shen, Guibao and Du, Yihua and Ge, Wenhang and He, Jing and Chang, Chirui and Zhou, Donghao and Yang, Zhen and Wang, Luozhou and Tao, Xin and Chen, Ying-Cong},
year={2025},
eprint={2512.16915},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16915},
}