ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-Ended Real-World Tasks

Sani, Samin Mahdizadeh; Ku, Max; Jamali, Nima; Sani, Matina Mahdizadeh; Khoshtab, Paria; Sun, Wei-Chieh; Fazel, Parnian; Tam, Zhi Rui; Chong, Thomas; Chan, Edisy Kin Wai; Tsang, Donald Wai Tong; Hsu, Chiao-Wei; Wai, Lam Ting; Ng, Ho Yin Sam; Chu, Chiafeng; Mak, Chak-Wing; Wu, Keming; Wong, Hiu Tung; Ho, Yik Chun; Ruan, Chi; Li, Zhuofeng; Fang, I-Sheng; Yeh, Shih-Ying; Cheng, Ho Kei; Nie, Ping; Chen, Wenhu

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-Ended Real-World Tasks

ICLR 2026

/iclr/2026/sani2026iclr-imagenworld/

Abstract

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Sani et al. "ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-Ended Real-World Tasks." International Conference on Learning Representations, 2026.

Markdown

[Sani et al. "ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-Ended Real-World Tasks." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/sani2026iclr-imagenworld/)

BibTeX

@inproceedings{sani2026iclr-imagenworld,
  title     = {{ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-Ended Real-World Tasks}},
  author    = {Sani, Samin Mahdizadeh and Ku, Max and Jamali, Nima and Sani, Matina Mahdizadeh and Khoshtab, Paria and Sun, Wei-Chieh and Fazel, Parnian and Tam, Zhi Rui and Chong, Thomas and Chan, Edisy Kin Wai and Tsang, Donald Wai Tong and Hsu, Chiao-Wei and Wai, Lam Ting and Ng, Ho Yin Sam and Chu, Chiafeng and Mak, Chak-Wing and Wu, Keming and Wong, Hiu Tung and Ho, Yik Chun and Ruan, Chi and Li, Zhuofeng and Fang, I-Sheng and Yeh, Shih-Ying and Cheng, Ho Kei and Nie, Ping and Chen, Wenhu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/sani2026iclr-imagenworld/}
}