MultiDiff: Consistent Novel View Synthesis from a Single Image

Abstract

We introduce MultiDiff a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature as there exist multiple plausible explanations for unobserved areas. To address this issue we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements while reducing inference time by an order of magnitude. For additional consistency and image quality improvements we introduce a novel structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging real-world datasets RealEstate10K and ScanNet. Finally our model naturally supports multi-view consistent editing without the need for further tuning.

Cite

Text

Müller et al. "MultiDiff: Consistent Novel View Synthesis from a Single Image." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00977

Markdown

[Müller et al. "MultiDiff: Consistent Novel View Synthesis from a Single Image." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/muller2024cvpr-multidiff/) doi:10.1109/CVPR52733.2024.00977

BibTeX

@inproceedings{muller2024cvpr-multidiff,
  title     = {{MultiDiff: Consistent Novel View Synthesis from a Single Image}},
  author    = {Müller, Norman and Schwarz, Katja and Rössle, Barbara and Porzi, Lorenzo and Bulò, Samuel Rota and Nießner, Matthias and Kontschieder, Peter},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {10258-10268},
  doi       = {10.1109/CVPR52733.2024.00977},
  url       = {https://mlanthology.org/cvpr/2024/muller2024cvpr-multidiff/}
}