VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning
Abstract
Large vision-language models (LVLMs) have emerged as foundational tools for real-world AI applications. Despite their remarkable capabilities, current LVLMs process entire images at the token level, leading to significant inefficiencies compared to human cognition, which selectively focuses on high-level vision concepts. This token-level redundancy becomes increasingly problematic for high-resolution images and long video sequences, resulting in large computational costs and limited scalability in practical applications. To address this limitation, we introduce the concept of a vision concept model, a novel paradigm that enables LVLMs to dynamically extract the most relevant vision concepts from complex inputs, based on task-specific instructions. To optimize this vision concept modeling process, we propose VCM, a self-supervised framework that leverages vision-language correlations across diverse instances. VCM is designed to learn meaningful vision concepts without the need for expensive concept-level annotations. At its core, it employs a forward-backward optimization algorithm that supports LVLMs to adjust concept granularity and spatial alignment dynamically. Experiments demonstrate that VCM remarkably reduces computational costs (e.g., achieving up to 85\% fewer FLOPs for LLaVA-1.5-7B), while maintaining strong performance across a series of vision-language tasks. The codebase is available at https://github.com/RainBowLuoCS/VCM.
Cite
Text
Luo et al. "VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning." Advances in Neural Information Processing Systems, 2025.Markdown
[Luo et al. "VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/luo2025neurips-vcm/)BibTeX
@inproceedings{luo2025neurips-vcm,
title = {{VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning}},
author = {Luo, Run and Shan, Renke and Chen, Longze and Liu, Ziqiang and Wang, Lu and Yang, Min and Xia, Xiaobo},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/luo2025neurips-vcm/}
}