Window Token Concatenation for Efficient Visual Large Language Models

Li, Yifan; Bao, Wentao; Ye, Botao; Tan, Zhen; Chen, Tianlong; Liu, Huan; Kong, Yu

Window Token Concatenation for Efficient Visual Large Language Models

Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

CVPRW 2025 pp. 3187-3197

/cvprw/2025/li2025cvprw-window/

Abstract

To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.

PDF CVPRW Semantic Scholar

Cite

Text

Li et al. "Window Token Concatenation for Efficient Visual Large Language Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Li et al. "Window Token Concatenation for Efficient Visual Large Language Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/li2025cvprw-window/)

BibTeX

@inproceedings{li2025cvprw-window,
  title     = {{Window Token Concatenation for Efficient Visual Large Language Models}},
  author    = {Li, Yifan and Bao, Wentao and Ye, Botao and Tan, Zhen and Chen, Tianlong and Liu, Huan and Kong, Yu},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {3187-3197},
  url       = {https://mlanthology.org/cvprw/2025/li2025cvprw-window/}
}