MAGVIT: Masked Generative Video Transformer

Abstract

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

Cite

Text

Yu et al. "MAGVIT: Masked Generative Video Transformer." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01008

Markdown

[Yu et al. "MAGVIT: Masked Generative Video Transformer." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/yu2023cvpr-magvit/) doi:10.1109/CVPR52729.2023.01008

BibTeX

@inproceedings{yu2023cvpr-magvit,
  title     = {{MAGVIT: Masked Generative Video Transformer}},
  author    = {Yu, Lijun and Cheng, Yong and Sohn, Kihyuk and Lezama, José and Zhang, Han and Chang, Huiwen and Hauptmann, Alexander G. and Yang, Ming-Hsuan and Hao, Yuan and Essa, Irfan and Jiang, Lu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {10459-10469},
  doi       = {10.1109/CVPR52729.2023.01008},
  url       = {https://mlanthology.org/cvpr/2023/yu2023cvpr-magvit/}
}