Factorizing Text-to-Video Generation by Explicit Image Conditioning

Abstract

We present , a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions–adjusted noise schedules for diffusion, and multi-stage training–that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work–81% vs. Google’s , 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video. Our model outperforms commercial solutions such as RunwayML’s Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work.

Cite

Text

Girdhar et al. "Factorizing Text-to-Video Generation by Explicit Image Conditioning." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73033-7_12

Markdown

[Girdhar et al. "Factorizing Text-to-Video Generation by Explicit Image Conditioning." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/girdhar2024eccv-factorizing/) doi:10.1007/978-3-031-73033-7_12

BibTeX

@inproceedings{girdhar2024eccv-factorizing,
  title     = {{Factorizing Text-to-Video Generation by Explicit Image Conditioning}},
  author    = {Girdhar, Rohit and Singh, Mannat and Brown, Andrew and Duval, Quentin and Azadi, Samaneh and Rambhatla, Sai Saketh and Shah, Mian Akbar and Yin, Xi and Parikh, Devi and Misra, Ishan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73033-7_12},
  url       = {https://mlanthology.org/eccv/2024/girdhar2024eccv-factorizing/}
}