HARIVO: Harnessing Text-to-Image Models for Video Generation

Abstract

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO/

Cite

Text

Kwon et al. "HARIVO: Harnessing Text-to-Image Models for Video Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73668-1_2

Markdown

[Kwon et al. "HARIVO: Harnessing Text-to-Image Models for Video Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/kwon2024eccv-harivo/) doi:10.1007/978-3-031-73668-1_2

BibTeX

@inproceedings{kwon2024eccv-harivo,
  title     = {{HARIVO: Harnessing Text-to-Image Models for Video Generation}},
  author    = {Kwon, Mingi and Oh, Seoung Wug and Zhou, Yang and Lee, Joon-Young and Liu, Difan and Cai, Haoran and Liu, Baqiao and Liu, Feng and Uh, Youngjung},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73668-1_2},
  url       = {https://mlanthology.org/eccv/2024/kwon2024eccv-harivo/}
}