World-Consistent Video Diffusion with Explicit 3D Modeling

Abstract

Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

Cite

Text

Zhang et al. "World-Consistent Video Diffusion with Explicit 3D Modeling." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02020

Markdown

[Zhang et al. "World-Consistent Video Diffusion with Explicit 3D Modeling." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhang2025cvpr-worldconsistent/) doi:10.1109/CVPR52734.2025.02020

BibTeX

@inproceedings{zhang2025cvpr-worldconsistent,
  title     = {{World-Consistent Video Diffusion with Explicit 3D Modeling}},
  author    = {Zhang, Qihang and Zhai, Shuangfei and Martin, Miguel Ángel Bautista and Miao, Kevin and Toshev, Alexander and Susskind, Joshua and Gu, Jiatao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {21685-21695},
  doi       = {10.1109/CVPR52734.2025.02020},
  url       = {https://mlanthology.org/cvpr/2025/zhang2025cvpr-worldconsistent/}
}