VisR-Bench: A Visual Retrieval Benchmark for Visually-Rich Documents
Abstract
Retrieval is essential for multimodal large language models (MLLMs) to handle long contexts and improve factual accuracy. However, existing benchmarks focus on end-to-end answer generation, making retrieval evaluation difficult. To address this, we introduce VisR-Bench, a benchmark for question-driven retrieval in scanned documents. Our queries do not explicitly contain answers, preventing models from relying on keyword matching. Additionally, they avoid ambiguous references to figures or tables by ensuring that each query includes descriptive information necessary to locate the correct content. The dataset spans English and 15 other languages, with English queries enabling fine-grained evaluation across answer modalities (tables, text, figures) and non-English queries focus on multilingual generalization. VisR-Bench provides a comprehensive framework for evaluating retrieval in document understanding.
Cite
Text
Chen et al. "VisR-Bench: A Visual Retrieval Benchmark for Visually-Rich Documents." ICLR 2025 Workshops: FM-Wild, 2025.Markdown
[Chen et al. "VisR-Bench: A Visual Retrieval Benchmark for Visually-Rich Documents." ICLR 2025 Workshops: FM-Wild, 2025.](https://mlanthology.org/iclrw/2025/chen2025iclrw-visrbench/)BibTeX
@inproceedings{chen2025iclrw-visrbench,
title = {{VisR-Bench: A Visual Retrieval Benchmark for Visually-Rich Documents}},
author = {Chen, Jian and Zhang, Ruiyi and Li, Ming and Zhou, Shijie and Chen, Changyou},
booktitle = {ICLR 2025 Workshops: FM-Wild},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/chen2025iclrw-visrbench/}
}