MVD-Fusion: Single-View 3D via Depth-Consistent Multi-View Generation

Abstract

We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. While recent methods pursuing 3D inference advocate learning novel-view generative models these generations are not 3D-consistent and require a distillation process to generate a 3D output. We instead cast the task of 3D inference as directly generating mutually-consistent multiple views and build on the insight that additionally inferring depth can provide a mechanism for enforcing this consistency. Specifically we train a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image and leverage the (intermediate noisy) depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. We train our model using large-scale synthetic dataset Obajverse as well as the real-world CO3D dataset comprising of generic camera viewpoints. We demonstrate that our approach can yield more accurate synthesis compared to recent state-of-the-art including distillation-based 3D inference and prior multi-view generation methods. We also evaluate the geometry induced by our multi-view depth prediction and find that it yields a more accurate representation than other direct 3D inference approaches.

Cite

Text

Hu et al. "MVD-Fusion: Single-View 3D via Depth-Consistent Multi-View Generation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00926

Markdown

[Hu et al. "MVD-Fusion: Single-View 3D via Depth-Consistent Multi-View Generation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/hu2024cvpr-mvdfusion/) doi:10.1109/CVPR52733.2024.00926

BibTeX

@inproceedings{hu2024cvpr-mvdfusion,
  title     = {{MVD-Fusion: Single-View 3D via Depth-Consistent Multi-View Generation}},
  author    = {Hu, Hanzhe and Zhou, Zhizhuo and Jampani, Varun and Tulsiani, Shubham},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {9698-9707},
  doi       = {10.1109/CVPR52733.2024.00926},
  url       = {https://mlanthology.org/cvpr/2024/hu2024cvpr-mvdfusion/}
}