Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
Abstract
High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To drive progress beyond the limits of heuristic methods, this paper advances HR perception capabilities of MLLMs by harnessing cutting-edge long-context techniques such as retrieval-augmented generation (RAG). Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench. Code is available at https://github.com/DreamMr/RAP.
Cite
Text
Wang et al. "Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Wang et al. "Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/wang2025icml-retrievalaugmented/)BibTeX
@inproceedings{wang2025icml-retrievalaugmented,
title = {{Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG}},
author = {Wang, Wenbin and Jing, Yongcheng and Ding, Liang and Wang, Yingjie and Shen, Li and Luo, Yong and Du, Bo and Tao, Dacheng},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {63290-63307},
volume = {267},
url = {https://mlanthology.org/icml/2025/wang2025icml-retrievalaugmented/}
}