VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Abstract

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreoever, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. All our code and data are open-sourced.

Cite

Text

Yang et al. "VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Yang et al. "VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yang2025neurips-visionthink/)

BibTeX

@inproceedings{yang2025neurips-visionthink,
  title     = {{VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning}},
  author    = {Yang, Senqiao and Li, Junyi and Lai, Xin and Wu, Jinming and Li, Wei and Ma, Zejun and Yu, Bei and Zhao, Hengshuang and Jia, Jiaya},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/yang2025neurips-visionthink/}
}