SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
Abstract
Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of *any* MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.
Cite
Text
Chen et al. "SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding." International Conference on Learning Representations, 2025.Markdown
[Chen et al. "SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/chen2025iclr-svrag/)BibTeX
@inproceedings{chen2025iclr-svrag,
title = {{SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding}},
author = {Chen, Jian and Zhang, Ruiyi and Zhou, Yufan and Yu, Tong and Dernoncourt, Franck and Gu, Jiuxiang and Rossi, Ryan A. and Chen, Changyou and Sun, Tong},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/chen2025iclr-svrag/}
}