LMM4LMM: Benchmarking and Evaluating Large-Multimodal Image Generation with LMMs

Abstract

Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation,which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models.Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perceptual quality, text-image correspondence, and task-specific accuracy.Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. Both EvalMi-50K and LMM4LMM will be released at https://github.com/IntMeGroup/LMM4LMM.

Cite

Text

Wang et al. "LMM4LMM: Benchmarking and Evaluating Large-Multimodal Image Generation with LMMs." International Conference on Computer Vision, 2025.

Markdown

[Wang et al. "LMM4LMM: Benchmarking and Evaluating Large-Multimodal Image Generation with LMMs." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/wang2025iccv-lmm4lmm/)

BibTeX

@inproceedings{wang2025iccv-lmm4lmm,
  title     = {{LMM4LMM: Benchmarking and Evaluating Large-Multimodal Image Generation with LMMs}},
  author    = {Wang, Jiarui and Duan, Huiyu and Zhao, Yu and Wang, Juntong and Zhai, Guangtao and Min, Xiongkuo},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17312-17323},
  url       = {https://mlanthology.org/iccv/2025/wang2025iccv-lmm4lmm/}
}