High-Quality Joint Image and Video Tokenization with Causal VAE
Abstract
Generative modeling has seen significant advancements in image and video synthesis. However, the curse of dimensionality remains a significant obstacle, especially for video generation, given its inherently complex and high-dimensional nature. Many existing works rely on low-dimensional latent spaces from pretrained image autoencoders. However, this approach overlooks temporal redundancy in videos and often leads to temporally incoherent decoding. To address this issue, we propose a video compression network that reduces the dimensionality of visual data both spatially and temporally. Our model, based on a variational autoencoder, employs causal 3D convolution to handle images and videos jointly. The key contributions of our work include a scale-agnostic encoder for preserving video fidelity, a novel spatio-temporal down/upsampling block for robust long-sequence modeling, and a flow regularization loss for accurate motion decoding. Our approach outperforms competitors in video quality and compression rates across various datasets. Experimental analyses also highlight its potential as a robust autoencoder for video generation training.
Cite
Text
Argaw et al. "High-Quality Joint Image and Video Tokenization with Causal VAE." International Conference on Learning Representations, 2025.Markdown
[Argaw et al. "High-Quality Joint Image and Video Tokenization with Causal VAE." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/argaw2025iclr-highquality/)BibTeX
@inproceedings{argaw2025iclr-highquality,
title = {{High-Quality Joint Image and Video Tokenization with Causal VAE}},
author = {Argaw, Dawit Mureja and Liu, Xian and Zhang, Qinsheng and Chung, Joon Son and Liu, Ming-Yu and Reda, Fitsum},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/argaw2025iclr-highquality/}
}