Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

Abstract

Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance. Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.

Cite

Text

Zeng et al. "Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping." International Conference on Computer Vision, 2025.

Markdown

[Zeng et al. "Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zeng2025iccv-skipvision/)

BibTeX

@inproceedings{zeng2025iccv-skipvision,
  title     = {{Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping}},
  author    = {Zeng, Weili and Huang, Ziyuan and Ji, Kaixiang and Yan, Yichao},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {21384-21397},
  url       = {https://mlanthology.org/iccv/2025/zeng2025iccv-skipvision/}
}