Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Abstract

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions remain about how auto-encoder design impacts reconstruction and downstream generative performance. This work explores scaling in auto-encoders for reconstruction and generation by replacing the convolutional backbone with an enhanced Vision Transformer for Tokenization (ViTok). We find scaling the auto-encoder bottleneck correlates with reconstruction but exhibits a nuanced relationship with generation. Separately, encoder scaling yields no gains, while decoder scaling improves reconstruction with minimal impact on generation. As a result, we determine that scaling the current paradigm of auto-encoders is not effective for improving generation performance. Coupled with Diffusion Transformers, ViTok achieves competitive image reconstruction and generation performance on 256p and 512p ImageNet-1K. In videos, ViTok achieves SOTA reconstruction and generation performance on 16-frame 128p UCF-101.

Cite

Text

Hansen-Estruch et al. "Learnings from Scaling Visual Tokenizers for Reconstruction and Generation." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Hansen-Estruch et al. "Learnings from Scaling Visual Tokenizers for Reconstruction and Generation." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/hansenestruch2025icml-learnings/)

BibTeX

@inproceedings{hansenestruch2025icml-learnings,
  title     = {{Learnings from Scaling Visual Tokenizers for Reconstruction and Generation}},
  author    = {Hansen-Estruch, Philippe and Yan, David and Chuang, Ching-Yao and Zohar, Orr and Wang, Jialiang and Hou, Tingbo and Xu, Tao and Vishwanath, Sriram and Vajda, Peter and Chen, Xinlei},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {22023-22043},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/hansenestruch2025icml-learnings/}
}