Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal

Abstract

Vision-Language Models (VLMs) excel at visual understanding by leveraging pretrained image encoders to generate visual tokens. However, they struggle with high-resolution images and zoomed-in regions due to the computational burden and token redundancy of uniform patch-based processing, often leading to the loss of critical details. To address these challenges, we propose Token-Efficient Vision Language Model (TEVA), a novel framework that detects key regions and applies dynamic patch sampling to efficiently capture fine-grained details while preserving global context. Our approach first identifies subject-oriented regions using an adaptive detection strategy. Then, a dynamic patch sampling mechanism selects and arranges patches at varying scales, ensuring efficient processing without increasing token count. Extensive experiments demonstrate that Token-Efficient Vision Language Model (TEVA) significantly enhances VLM performance in handling visual details, seamlessly integrating with various decoders and LLMs.

Cite

Text

Jiang et al. "Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal." International Conference on Computer Vision, 2025.

Markdown

[Jiang et al. "Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/jiang2025iccv-tokenefficient/)

BibTeX

@inproceedings{jiang2025iccv-tokenefficient,
  title     = {{Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal}},
  author    = {Jiang, Yitong and Gu, Jinwei and Xue, Tianfan and Cheung, Ka Chun and Molchanov, Pavlo and Yin, Hongxu and Liu, Sifei},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {24147-24158},
  url       = {https://mlanthology.org/iccv/2025/jiang2025iccv-tokenefficient/}
}