Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: **(1) Diversity in human scene**, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. **(2) Progressive and diverse evaluation dimensions**, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional multi-target and causal reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. **(3) High-quality annotations with rich data paradigms**, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. Our benchmark extends the single-person and single-image understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex question-answer pairs of their combination. The extensive experiments on 20 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding and reasoning. Data and code are available at [https://github.com/Yuan-Hou/Human-MME](https://github.com/Yuan-Hou/Human-MME).
Cite
Text
Liu et al. "Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models." International Conference on Learning Representations, 2026.Markdown
[Liu et al. "Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liu2026iclr-humanmme/)BibTeX
@inproceedings{liu2026iclr-humanmme,
title = {{Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models}},
author = {Liu, Yuansen and Tang, Haiming and Peng, Jinlong and Zhang, Jiangning and Ji, Xiaozhong and He, Qingdong and Luo, Donghao and Gan, Zhenye and Zhu, Junwei and Shen, Yunhang and Fu, Chaoyou and Wang, Chengjie and Hu, Xiaobin and Yan, Shuicheng},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/liu2026iclr-humanmme/}
}