Generating Long Videos of Dynamic Scenes

Abstract

We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive bias to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. We leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.

Cite

Text

Brooks et al. "Generating Long Videos of Dynamic Scenes." Neural Information Processing Systems, 2022.

Markdown

[Brooks et al. "Generating Long Videos of Dynamic Scenes." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/brooks2022neurips-generating/)

BibTeX

@inproceedings{brooks2022neurips-generating,
  title     = {{Generating Long Videos of Dynamic Scenes}},
  author    = {Brooks, Tim and Hellsten, Janne and Aittala, Miika and Wang, Ting-Chun and Aila, Timo and Lehtinen, Jaakko and Liu, Ming-Yu and Efros, Alexei and Karras, Tero},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/brooks2022neurips-generating/}
}