What Kind of Visual Tokens Do We Need? Training-Free Visual Token Pruning for Multi-Modal Large Language Models from the Perspective of Graph

Jiang, Yutao; Wu, Qiong; Lin, Wenhao; Yu, Wei; Zhou, Yiyi

doi:10.1609/AAAI.V39I4.32427

What Kind of Visual Tokens Do We Need? Training-Free Visual Token Pruning for Multi-Modal Large Language Models from the Perspective of Graph

Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou

AAAI 2025 pp. 4075-4083

doi:10.1609/AAAI.V39I4.32427 /aaai/2025/jiang2025aaai-kind/

Abstract

Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune. In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background. To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks. The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95% and 2.34% accuracy drops, respectively.

PDF AAAI Semantic Scholar

Cite

Text

Jiang et al. "What Kind of Visual Tokens Do We Need? Training-Free Visual Token Pruning for Multi-Modal Large Language Models from the Perspective of Graph." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32427

Markdown

[Jiang et al. "What Kind of Visual Tokens Do We Need? Training-Free Visual Token Pruning for Multi-Modal Large Language Models from the Perspective of Graph." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/jiang2025aaai-kind/) doi:10.1609/AAAI.V39I4.32427

BibTeX

@inproceedings{jiang2025aaai-kind,
  title     = {{What Kind of Visual Tokens Do We Need? Training-Free Visual Token Pruning for Multi-Modal Large Language Models from the Perspective of Graph}},
  author    = {Jiang, Yutao and Wu, Qiong and Lin, Wenhao and Yu, Wei and Zhou, Yiyi},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {4075-4083},
  doi       = {10.1609/AAAI.V39I4.32427},
  url       = {https://mlanthology.org/aaai/2025/jiang2025aaai-kind/}
}