MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Abstract
While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. In order to facilitate this research, we construct a new dataset MC-Bench that features 2K high-quality and manually annotated samples. Each sample consists of an instance-level labeled image pair and a corresponding text prompt that indicates the target instances in the images. These text prompts are highly open-ended and follow three distinct styles, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities, along with our developed simple yet effective agentic baseline and a finetuned baseline by multi-context instruction tuning. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans, along with some insightful observations that suggest potential future directions. We hope that MC-Bench and our empirical findings encourage the research community to further advance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts. Project page: https://xuyunqiu.github.io/MC-Bench.
Cite
Text
Xu et al. "MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs." International Conference on Computer Vision, 2025.Markdown
[Xu et al. "MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/xu2025iccv-mcbench/)BibTeX
@inproceedings{xu2025iccv-mcbench,
title = {{MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs}},
author = {Xu, Yunqiu and Zhu, Linchao and Yang, Yi},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {17675-17687},
url = {https://mlanthology.org/iccv/2025/xu2025iccv-mcbench/}
}