CAT: Content-Adaptive Image Tokenization

Abstract

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity and introducing unnecessary computate overhead for simpler images. To address this, we propose Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design (1) a caption-based evaluation system that leverages LLMs to predict content complexity and determine the optimal compression ratio for an image, and (2) a novel nested VAE architecture that performs variable-rate compression in a single model. Trained on images with varying complexity, CAT achieves an average of 15% reduction in rFID across seven detail-rich datasets containing text, humans, and complex textures. On natural image datasets like ImageNet and COCO, it reduces token usage by 18% while maintaining high-fidelity reconstructions. We further evaluate CAT on two downstream tasks. For image classification, CAT consistently improves top-1 accuracy across five datasets spanning diverse domains. For image generation, it boosts training throughput by 23% on ImageNet, leading to more efficient learning and improved FIDs over fixed-token baselines.

Cite

Text

Shen et al. "CAT: Content-Adaptive Image Tokenization." Advances in Neural Information Processing Systems, 2025.

Markdown

[Shen et al. "CAT: Content-Adaptive Image Tokenization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/shen2025neurips-cat/)

BibTeX

@inproceedings{shen2025neurips-cat,
  title     = {{CAT: Content-Adaptive Image Tokenization}},
  author    = {Shen, Junhong and Tirumala, Kushal and Yasunaga, Michihiro and Misra, Ishan and Zettlemoyer, Luke and Yu, Lili and Zhou, Chunting},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/shen2025neurips-cat/}
}