Holistic Tokenizer for Autoregressive Image Generation

Abstract

Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce Hita, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) arranging a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving 2.59 FID and 281.9 IS on the ImageNet benchmark. Detailed analysis of the holistic representation highlights its ability to capture global image properties, such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at https://github.com/CVMI-Lab/Hita.

Cite

Text

Zheng et al. "Holistic Tokenizer for Autoregressive Image Generation." International Conference on Computer Vision, 2025.

Markdown

[Zheng et al. "Holistic Tokenizer for Autoregressive Image Generation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zheng2025iccv-holistic/)

BibTeX

@inproceedings{zheng2025iccv-holistic,
  title     = {{Holistic Tokenizer for Autoregressive Image Generation}},
  author    = {Zheng, Anlin and Wang, Haochen and Zhao, Yucheng and Deng, Weipeng and Wang, Tiancai and Zhang, Xiangyu and Qi, Xiaojuan},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {16916-16926},
  url       = {https://mlanthology.org/iccv/2025/zheng2025iccv-holistic/}
}