Photorealistic Video Generation with Diffusion Models

Abstract

We present , a diffusion transformer for photorealistic video generation from text prompts. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 × 896 resolution at 8 frames per second.

Cite

Text

Gupta et al. "Photorealistic Video Generation with Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72986-7_23

Markdown

[Gupta et al. "Photorealistic Video Generation with Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/gupta2024eccv-photorealistic/) doi:10.1007/978-3-031-72986-7_23

BibTeX

@inproceedings{gupta2024eccv-photorealistic,
  title     = {{Photorealistic Video Generation with Diffusion Models}},
  author    = {Gupta, Agrim and Yu, Lijun and Sohn, Kihyuk and Gu, Xiuye and Hahn, Meera and Fei-Fei, Li and Essa, Irfan and Jiang, Lu and Lezama, Jose},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72986-7_23},
  url       = {https://mlanthology.org/eccv/2024/gupta2024eccv-photorealistic/}
}