High-Quality Joint Image and Video Tokenization with Causal VAE

Abstract

Generative modeling has seen significant advancements in image and video synthesis. However, the curse of dimensionality remains a significant obstacle, especially for video generation, given its inherently complex and high-dimensional nature. Many existing works rely on low-dimensional latent spaces from pretrained image autoencoders. However, this approach overlooks temporal redundancy in videos and often leads to temporally incoherent decoding. To address this issue, we propose a video compression network that reduces the dimensionality of visual data both spatially and temporally. Our model, based on a variational autoencoder, employs causal 3D convolution to handle images and videos jointly. The key contributions of our work include a scale-agnostic encoder for preserving video fidelity, a novel spatio-temporal down/upsampling block for robust long-sequence modeling, and a flow regularization loss for accurate motion decoding. Our approach outperforms competitors in video quality and compression rates across various datasets. Experimental analyses also highlight its potential as a robust autoencoder for video generation training.

Cite

Text

Argaw et al. "High-Quality Joint Image and Video Tokenization with Causal VAE." International Conference on Learning Representations, 2025.

Markdown

[Argaw et al. "High-Quality Joint Image and Video Tokenization with Causal VAE." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/argaw2025iclr-highquality/)

BibTeX

@inproceedings{argaw2025iclr-highquality,
  title     = {{High-Quality Joint Image and Video Tokenization with Causal VAE}},
  author    = {Argaw, Dawit Mureja and Liu, Xian and Zhang, Qinsheng and Chung, Joon Son and Liu, Ming-Yu and Reda, Fitsum},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/argaw2025iclr-highquality/}
}