Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Abstract

Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, ‘conceptualization’—the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: https://vga.csail.mit.edu/.

Cite

Text

Babaiee et al. "Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Babaiee et al. "Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/babaiee2025icml-visual/)

BibTeX

@inproceedings{babaiee2025icml-visual,
  title     = {{Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models}},
  author    = {Babaiee, Zahra and Kiasari, Peyman and Rus, Daniela and Grosu, Radu},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {2081-2113},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/babaiee2025icml-visual/}
}