Factorizing Text-to-Video Generation by Explicit Image Conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Mian Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

ECCV 2024

doi:10.1007/978-3-031-73033-7_12 /eccv/2024/girdhar2024eccv-factorizing/

Abstract

We present , a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions–adjusted noise schedules for diffusion, and multi-stage training–that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work–81% vs. Google’s , 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video. Our model outperforms commercial solutions such as RunwayML’s Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work.

PDF ECCV Semantic Scholar

Cite

Text

Girdhar et al. "Factorizing Text-to-Video Generation by Explicit Image Conditioning." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73033-7_12

Markdown

[Girdhar et al. "Factorizing Text-to-Video Generation by Explicit Image Conditioning." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/girdhar2024eccv-factorizing/) doi:10.1007/978-3-031-73033-7_12

BibTeX

@inproceedings{girdhar2024eccv-factorizing,
  title     = {{Factorizing Text-to-Video Generation by Explicit Image Conditioning}},
  author    = {Girdhar, Rohit and Singh, Mannat and Brown, Andrew and Duval, Quentin and Azadi, Samaneh and Rambhatla, Sai Saketh and Shah, Mian Akbar and Yin, Xi and Parikh, Devi and Misra, Ishan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73033-7_12},
  url       = {https://mlanthology.org/eccv/2024/girdhar2024eccv-factorizing/}
}