VisR-Bench: A Visual Retrieval Benchmark for Visually-Rich Documents

Abstract

Retrieval is essential for multimodal large language models (MLLMs) to handle long contexts and improve factual accuracy. However, existing benchmarks focus on end-to-end answer generation, making retrieval evaluation difficult. To address this, we introduce VisR-Bench, a benchmark for question-driven retrieval in scanned documents. Our queries do not explicitly contain answers, preventing models from relying on keyword matching. Additionally, they avoid ambiguous references to figures or tables by ensuring that each query includes descriptive information necessary to locate the correct content. The dataset spans English and 15 other languages, with English queries enabling fine-grained evaluation across answer modalities (tables, text, figures) and non-English queries focus on multilingual generalization. VisR-Bench provides a comprehensive framework for evaluating retrieval in document understanding.

Cite

Text

Chen et al. "VisR-Bench: A Visual  Retrieval Benchmark for Visually-Rich Documents." ICLR 2025 Workshops: FM-Wild, 2025.

Markdown

[Chen et al. "VisR-Bench: A Visual  Retrieval Benchmark for Visually-Rich Documents." ICLR 2025 Workshops: FM-Wild, 2025.](https://mlanthology.org/iclrw/2025/chen2025iclrw-visrbench/)

BibTeX

@inproceedings{chen2025iclrw-visrbench,
  title     = {{VisR-Bench: A Visual  Retrieval Benchmark for Visually-Rich Documents}},
  author    = {Chen, Jian and Zhang, Ruiyi and Li, Ming and Zhou, Shijie and Chen, Changyou},
  booktitle = {ICLR 2025 Workshops: FM-Wild},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/chen2025iclrw-visrbench/}
}