Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes

Abstract

Diffusion models have made significant strides in image generation mastering tasks such as unconditional image synthesis text-image translation and image-to-image conversions. However their capability falls short in the realm of video prediction mainly because they treat videos as a collection of independent images relying on external constraints such as temporal attention mechanisms to enforce temporal coherence. In our paper we introduce a novel model class that treats video as a continuous multi-dimensional process rather than a series of discrete frames. Through extensive experimentation we establish state-of-the-art performance in video prediction validated on benchmark datasets including KTH BAIR Human3.6M and UCF101.

Cite

Text

Shrivastava and Shrivastava. "Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00691

Markdown

[Shrivastava and Shrivastava. "Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/shrivastava2024cvpr-video/) doi:10.1109/CVPR52733.2024.00691

BibTeX

@inproceedings{shrivastava2024cvpr-video,
  title     = {{Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes}},
  author    = {Shrivastava, Gaurav and Shrivastava, Abhinav},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {7236-7245},
  doi       = {10.1109/CVPR52733.2024.00691},
  url       = {https://mlanthology.org/cvpr/2024/shrivastava2024cvpr-video/}
}