Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

Abstract

High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To drive progress beyond the limits of heuristic methods, this paper advances HR perception capabilities of MLLMs by harnessing cutting-edge long-context techniques such as retrieval-augmented generation (RAG). Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench. Code is available at https://github.com/DreamMr/RAP.

Cite

Text

Wang et al. "Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Wang et al. "Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/wang2025icml-retrievalaugmented/)

BibTeX

@inproceedings{wang2025icml-retrievalaugmented,
  title     = {{Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG}},
  author    = {Wang, Wenbin and Jing, Yongcheng and Ding, Liang and Wang, Yingjie and Shen, Li and Luo, Yong and Du, Bo and Tao, Dacheng},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {63290-63307},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/wang2025icml-retrievalaugmented/}
}