MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Xie, Wulin; Zhang, YiFan; Fu, Chaoyou; Shi, Yang; Zeng, Jianshu; Nie, Bingyan; Chen, Hongkai; Zhang, Zhang; Wang, Liang

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Wulin Xie, YiFan Zhang, Chaoyou Fu, Yang Shi, Jianshu Zeng, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang

ICLR 2026

/iclr/2026/xie2026iclr-mmeunify/

Abstract

Unified Multimodal Large Language Models (U-MLLMs) have garnered considerable interest for their ability to seamlessly integrate generation and comprehension tasks. However, existing research lacks a unified evaluation standard, often relying on isolated benchmarks to assess these capabilities. Moreover, current work highlights the potential of “mixed-modality generation capabilities” through case studies—such as generating auxiliary lines in images to solve geometric problems, or reasoning through a problem before generating a corresponding image. Despite this, there is no standardized benchmark to assess models on such unified tasks. To address this gap, we introduce MME-Unify, also termed as MME-U, the first open and reproducible benchmark designed to evaluate multimodal comprehension, generation, and mixed-modality generation capabilities. For comprehension and generation tasks, we curate a diverse set of tasks from 12 datasets, aligning their formats and metrics to develop a standardized evaluation framework. For unified tasks, we design five subtasks to rigorously assess how models’ understanding and generation capabilities can mutually enhance each other. Evaluation of 17 U-MLLMs, including Janus-Pro, Bagel, and Gemini2-Flash, reveals significant room for improvement, particularly in areas such as instruction following and image generation quality.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Xie et al. "MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models." International Conference on Learning Representations, 2026.

Markdown

[Xie et al. "MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xie2026iclr-mmeunify/)

BibTeX

@inproceedings{xie2026iclr-mmeunify,
  title     = {{MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models}},
  author    = {Xie, Wulin and Zhang, YiFan and Fu, Chaoyou and Shi, Yang and Zeng, Jianshu and Nie, Bingyan and Chen, Hongkai and Zhang, Zhang and Wang, Liang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/xie2026iclr-mmeunify/}
}