Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation
Abstract
In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer’s outputs with the foundation model’s representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, \textbf{\ours}, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation---achieving a gFID of \textbf{1.36} on ImageNet benchmarks, while accelerating model convergence by \textbf{three times}, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at \href{https://github.com/CVMI-Lab/VFMTok}https://github.com/CVMI-Lab/VFMTok.
Cite
Text
Zheng et al. "Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation." Advances in Neural Information Processing Systems, 2025.Markdown
[Zheng et al. "Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zheng2025neurips-vision/)BibTeX
@inproceedings{zheng2025neurips-vision,
title = {{Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation}},
author = {Zheng, Anlin and Wen, Xin and Zhang, Xuanyang and Ma, Chuofan and Wang, Tiancai and Yu, Gang and Zhang, Xiangyu and Qi, Xiaojuan},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zheng2025neurips-vision/}
}