Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Abstract

Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. To our knowledge, this is the first MLLMs dataset and benchmark for the MLLM community. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. The former learns instance discriminative tokens, while the latter further improves instruction following ability. CoLVA-InternVL2-4B achieves an overall accuracy (OA) of 49.80% on the MMVM benchmark, surpassing GPT-4o and the best open-source MLLM, Qwen2VL-72B, by 7.15% and 11.72% OA, respectively. These results demonstrate the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models will be released.

Cite

Text

Zhou et al. "Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs." International Conference on Computer Vision, 2025.

Markdown

[Zhou et al. "Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhou2025iccv-they/)

BibTeX

@inproceedings{zhou2025iccv-they,
  title     = {{Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs}},
  author    = {Zhou, Yikang and Zhang, Tao and Xu, Shilin and Chen, Shihao and Zhou, Qianyu and Tong, Yunhai and Ji, Shunping and Zhang, Jiangning and Qi, Lu and Li, Xiangtai},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17663-17674},
  url       = {https://mlanthology.org/iccv/2025/zhou2025iccv-they/}
}