A Dual Stream Visual Tokenizer for LLM Image Generation

Abstract

We proposes a novel visual tokenizer by combining high-level semantic tokens and low-level pixel tokens to represent images, aiming to address the challenges of image-to-sequence conversion for Large Language Models (LLMs). Existing visual tokenizers, such as VQ-VAE and diffusion-based models, either struggle with token explosion as image resolution increases or fail to capture detailed structural information. Our method introduces a dual-token system: high-level semantic tokens capture the main content of the image, while low-level pixel tokens preserve structural details. By integrating these tokens in a hybrid architecture, we leverage a VQ-VAE branch to generate low-resolution guidance and a diffusion process to reconstruct high-resolution images with both semantic coherence and structural accuracy. This approach significantly reduces the number of required tokens and enhances image reconstruction quality, offering an efficient solution for tasks like image generation and understanding based on LLMs.

Cite

Text

Li et al. "A Dual Stream Visual Tokenizer for LLM Image Generation." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/167

Markdown

[Li et al. "A Dual Stream Visual Tokenizer for LLM Image Generation." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/li2025ijcai-dual/) doi:10.24963/IJCAI.2025/167

BibTeX

@inproceedings{li2025ijcai-dual,
  title     = {{A Dual Stream Visual Tokenizer for LLM Image Generation}},
  author    = {Li, Yongqian and Luo, Yong and Cai, Xiantao and He, Zheng and Meng, Zhennan and Wang, Nidong and Chen, Yunlin and Li, Zhifei},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {1494-1502},
  doi       = {10.24963/IJCAI.2025/167},
  url       = {https://mlanthology.org/ijcai/2025/li2025ijcai-dual/}
}