SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Abstract

Vision language models have received increasing attention for their ability to integrate visual and textual understanding, with some capable of processing native-resolution images and long videos. While the capacity to process large visual data unlocks numerous downstream applications, it often introduces significant latency challenges, as the visual tokens dominate the resource consumption. In this work, we introduce SparseVILA, a novel method of query-aware token retrieval to dynamically accelerate the underlying LLM by pruning tokens in the prefill stage while attending to a sparse subset of visual tokens during the decoding phase. By decoupling the context and generation compression, we can migrate the majority of sparsity into the generation stage, enabling query-aware support for multi-turn conversation while achieving a 1.4x speedup on image benchmarks. This approach leads to +5.9% average accuracy improvements on image-centric benchmarks over previous works. Finally, SparseVILA enables efficient long-context/long-generation tasks by achieving a 3.6x and 1.7x speedup in prefill and decoding, respectively.

Cite

Text

Khaki et al. "SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference." International Conference on Computer Vision, 2025.

Markdown

[Khaki et al. "SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/khaki2025iccv-sparsevila/)

BibTeX

@inproceedings{khaki2025iccv-sparsevila,
  title     = {{SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference}},
  author    = {Khaki, Samir and Guo, Junxian and Tang, Jiaming and Yang, Shang and Chen, Yukang and Plataniotis, Konstantinos N. and Lu, Yao and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {23784-23794},
  url       = {https://mlanthology.org/iccv/2025/khaki2025iccv-sparsevila/}
}