SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Abstract

The increasing application of multi-modal large language models (MLLMs) across various sectors has spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by 7 key features: it is based on bilingual, it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 scenario domains. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.

Cite

Text

Cheng et al. "SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models." International Conference on Computer Vision, 2025.

Markdown

[Cheng et al. "SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/cheng2025iccv-simplevqa/)

BibTeX

@inproceedings{cheng2025iccv-simplevqa,
  title     = {{SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models}},
  author    = {Cheng, Xianfu and Zhang, Wei and Zhang, Shiwei and Yang, Jian and Guan, Xiangyuan and Wu, Xianjie and Li, Xiang and Zhang, Ge and Liu, Jiaheng and Mai, Yuying and Zeng, Yutao and Wen, Zhoufutu and Jin, Ke and Wang, Baorui and Zhou, Weixiao and Lu, Yunhong and Ji, Hangyuan and Li, Tongliang and Huang, Wenhao and Li, Zhoujun},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {4637-4646},
  url       = {https://mlanthology.org/iccv/2025/cheng2025iccv-simplevqa/}
}