MathVerse: Does Your Multi-Modal LLM Truly See the Diagrams in Visual Math Problems?

Abstract

The remarkable progress of Multi-modal Large Language Models (MLLMs) has gained unparalleled attention. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce , an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging true or false, we employ GPT-4(V) to adaptively assess each step with error analysis to derive a total score, which can reveal the inner CoT reasoning quality by MLLMs. With , we unveil that, most existing MLLMs struggle to understand math diagrams, relying heavily on textual questions. Surprisingly, some of them even achieve 5%+ higher accuracy without the visual input. Besides, GPT-4V and MAVIS-7B achieve the best overall performance within closed-source and open-source models, respectively. We hope the benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io. ∗ Equal contribution ‡ Project lead † Corresponding author

Cite

Text

Zhang et al. "MathVerse: Does Your Multi-Modal LLM Truly See the Diagrams in Visual Math Problems?." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73242-3_10

Markdown

[Zhang et al. "MathVerse: Does Your Multi-Modal LLM Truly See the Diagrams in Visual Math Problems?." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhang2024eccv-mathverse/) doi:10.1007/978-3-031-73242-3_10

BibTeX

@inproceedings{zhang2024eccv-mathverse,
  title     = {{MathVerse: Does Your Multi-Modal LLM Truly See the Diagrams in Visual Math Problems?}},
  author    = {Zhang, Renrui and Jiang, Dongzhi and Zhang, Yichi and Lin, Haokun and Guo, Ziyu and Qiu, Pengshuo and Zhou, Aojun and Lu, Pan and Chang, Kai-Wei and Gao, Peng and Li, Hongsheng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73242-3_10},
  url       = {https://mlanthology.org/eccv/2024/zhang2024eccv-mathverse/}
}