VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Abstract
We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
Cite
Text
Tanaka et al. "VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02312Markdown
[Tanaka et al. "VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/tanaka2025cvpr-vdocrag/) doi:10.1109/CVPR52734.2025.02312BibTeX
@inproceedings{tanaka2025cvpr-vdocrag,
title = {{VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents}},
author = {Tanaka, Ryota and Iki, Taichi and Hasegawa, Taku and Nishida, Kyosuke and Saito, Kuniko and Suzuki, Jun},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {24827-24837},
doi = {10.1109/CVPR52734.2025.02312},
url = {https://mlanthology.org/cvpr/2025/tanaka2025cvpr-vdocrag/}
}