Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

NeurIPS 2025

/neurips/2025/zheng2025neurips-vision/

Abstract

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer’s outputs with the foundation model’s representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, \textbf{\ours}, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation---achieving a gFID of \textbf{1.36} on ImageNet benchmarks, while accelerating model convergence by \textbf{three times}, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at \href{https://github.com/CVMI-Lab/VFMTok}https://github.com/CVMI-Lab/VFMTok.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zheng et al. "Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zheng et al. "Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zheng2025neurips-vision/)

BibTeX

@inproceedings{zheng2025neurips-vision,
  title     = {{Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation}},
  author    = {Zheng, Anlin and Wen, Xin and Zhang, Xuanyang and Ma, Chuofan and Wang, Tiancai and Yu, Gang and Zhang, Xiangyu and Qi, Xiaojuan},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zheng2025neurips-vision/}
}