VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning

Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia

NeurIPS 2025

/neurips/2025/luo2025neurips-vcm/

Abstract

Large vision-language models (LVLMs) have emerged as foundational tools for real-world AI applications. Despite their remarkable capabilities, current LVLMs process entire images at the token level, leading to significant inefficiencies compared to human cognition, which selectively focuses on high-level vision concepts. This token-level redundancy becomes increasingly problematic for high-resolution images and long video sequences, resulting in large computational costs and limited scalability in practical applications. To address this limitation, we introduce the concept of a vision concept model, a novel paradigm that enables LVLMs to dynamically extract the most relevant vision concepts from complex inputs, based on task-specific instructions. To optimize this vision concept modeling process, we propose VCM, a self-supervised framework that leverages vision-language correlations across diverse instances. VCM is designed to learn meaningful vision concepts without the need for expensive concept-level annotations. At its core, it employs a forward-backward optimization algorithm that supports LVLMs to adjust concept granularity and spatial alignment dynamically. Experiments demonstrate that VCM remarkably reduces computational costs (e.g., achieving up to 85\% fewer FLOPs for LLaVA-1.5-7B), while maintaining strong performance across a series of vision-language tasks. The codebase is available at https://github.com/RainBowLuoCS/VCM.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Luo et al. "VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Luo et al. "VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/luo2025neurips-vcm/)

BibTeX

@inproceedings{luo2025neurips-vcm,
  title     = {{VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning}},
  author    = {Luo, Run and Shan, Renke and Chen, Longze and Liu, Ziqiang and Wang, Lu and Yang, Min and Xia, Xiaobo},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/luo2025neurips-vcm/}
}