Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Abstract

While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.

Cite

Text

Li et al. "Vivid-ZOO: Multi-View Video Generation with Diffusion Model." Neural Information Processing Systems, 2024. doi:10.52202/079017-1987

Markdown

[Li et al. "Vivid-ZOO: Multi-View Video Generation with Diffusion Model." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/li2024neurips-vividzoo/) doi:10.52202/079017-1987

BibTeX

@inproceedings{li2024neurips-vividzoo,
  title     = {{Vivid-ZOO: Multi-View Video Generation with Diffusion Model}},
  author    = {Li, Bing and Zheng, Cheng and Zhu, Wenxuan and Mai, Jinjie and Zhang, Biao and Wonka, Peter and Ghanem, Bernard},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-1987},
  url       = {https://mlanthology.org/neurips/2024/li2024neurips-vividzoo/}
}