StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

1HKUST(GZ)   2HKUST   3Kling Team, Kuaishou Technology   4CUHK
*Equal contribution   This work was conducted during the author's internship at Kling   Corresponding author

Please enable sound for the best experience

Showcase Video

Introduction Video

Abstract

The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage “Depth-Warp-Inpaint” (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. All data and code will be made publicly available.

Contributions

  • We introduce UniStereo, the first large-scale, unified dataset for stereo video conversion, featuring both parallel and converged formats to enable fair benchmarking and model comparisons.
  • We propose StereoPilot, an efficient feed-forward architecture that leverages a pretrained video diffusion transformer to directly synthesize the novel view. It overcomes the limitations of “Depth-Warp-Inpaint” methods (error propagation, depth ambiguity, and format-specific assumptions) without iterative denoising overhead, while integrating a domain switcher and cycle consistency loss for robust multi-format processing.
  • Extensive experiments show StereoPilot significantly outperforms state-of-the-art methods on our UniStereo benchmark in both visual quality and efficiency.

Method & Pipeline

StereoPilot Pipeline

The training framework of the proposed StereoPilot. StereoPilot uses a single-step feed-forward architecture (Diffusion as Feed-Forward) that incorporates a learnable domain switcher s to unify conversion for both parallel and converged stereo formats. The entire model is optimized using a cycle-consistent training strategy, combining reconstruction and cycle-consistency losses to ensure high fidelity and precise geometric alignment.

The blue and orange lines represent the Left-to-Right and Right-to-Left reconstruction processes, and the orange dashed line denotes the L → R → L cycle-consistency path.

UniStereo Dataset & Data Preparation

Parallel vs. Converged Stereo

Parallel vs Converged Stereo

In the parallel setup, when both eyes observe the same subject, the projected image points on the left and right views are denoted as XL and XR, and their absolute difference |XL - XR| defines the disparity s. According to geometric relationships derived from similar triangles, b, f, d, and s satisfy an inverse proportionality between disparity and depth when the baseline b and focal length f remain constant. In the converged configuration, a Zero-disparity Projection Plane is present—objects in front of this plane yield positive disparity, while those behind it produce negative disparity.

Converged Stereo Examples

Avatar (Converged)
Pacific Rim (Converged)

Parallel Stereo Examples

Parallel Example 1
Parallel Example 2

UniStereo Construction Pipeline

Data Pipeline

We use green icons with numbered steps to depict the Stereo4D pipeline: starting from the raw VR180 videos, we set hfov = 90° and specify the projection resolution to produce the final left- and right-eye monocular videos. Simultaneously, blue icons with numbered steps denote the 3DMovie pipeline: we segment the source films into clips, filter out non-informative segments, convert from side-by-side (SBS) to left/right monocular views, and remove black borders. All resulting videos are captioned using ShareGPT4Video.

Experiments

Qualitative Results

Qualitative Results

Qualitative Results. Our method achieves more accurate disparity estimation and preserves finer visual details on both Parallel and Converged data compared with existing baselines.

Quantitative Results

Method Venues Stereo4D-Parallel Format 3D Movie-Converged Format Latency↓
SSIM↑ MS-SSIM↑ PSNR↑ LPIPS↓ SIOU↑ SSIM↑ MS-SSIM↑ PSNR↑ LPIPS↓ SIOU↑
StereoDiffusion CVPR'24 0.642 0.711 20.541 0.245 0.252 0.678 0.612 20.695 0.341 0.181 60 min
StereoCrafter arXiv'24 0.553 0.562 17.673 0.298 0.226 0.706 0.799 23.794 0.203 0.213 1 min
SVG ICLR'25 0.561 0.543 17.971 0.368 0.220 0.653 0.553 19.059 0.426 0.166 70 min
ReCamMaster ICCV'25 0.542 0.525 17.229 0.312 0.239 -- -- -- -- -- 15 min
M2SVid 3DV'26 -- 0.915 26.200 0.180 -- -- -- -- -- -- --
Mono2Stereo CVPR'25 0.649 0.721 20.894 0.222 0.241 0.795 0.810 25.756 0.191 0.201 15 min
StereoPilot (Ours) -- 0.861 0.937 27.735 0.087 0.408 0.837 0.872 27.856 0.122 0.260 11 s

In-Domain results on UniStereo dataset (GT vs. Ours)

Switch between Anaglyph Anaglyph and SBS Side-by-Side (SBS) views.

Out-of-Domain results on Native 2D Movies dataset (OOD)

Generalization on Native 2D Movies (No Ground Truth). Switch between Anaglyph Anaglyph and SBS Side-by-Side (SBS).

Please enable sound for the best experience

Effectiveness of Domain Switcher

Domain Switcher Example 1
Domain Switcher Example 2

Parallel vs. Converged Comparison

Switch between Anaglyph Anaglyph and SBS Side-by-Side (SBS) views.

BibTeX

@misc{shen2025stereopilot,
        title={StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors},
        author={Shen, Guibao and Du, Yihua and Ge, Wenhang and He, Jing and Chang, Chirui and Zhou, Donghao and Yang, Zhen and Wang, Luozhou and Tao, Xin and Chen, Ying-Cong},
        year={2025},
        eprint={2512.16915},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.16915}, 
      }