ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

Guo, Yuxiang; Liu, Jiang; Wang, Ze; Chen, Hao; Sun, Ximeng; Zhao, Yang; Wu, Jialian; Yu, Xiaodong; Liu, Zicheng; Barsoum, Emad

ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu, Emad Barsoum

ICLR 2026

/iclr/2026/guo2026iclr-imagedoctor/

Abstract

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a ``look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality—achieving an improvement of 10% over scalar-based reward models.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Guo et al. "ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Guo et al. "ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/guo2026iclr-imagedoctor/)

BibTeX

@inproceedings{guo2026iclr-imagedoctor,
  title     = {{ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning}},
  author    = {Guo, Yuxiang and Liu, Jiang and Wang, Ze and Chen, Hao and Sun, Ximeng and Zhao, Yang and Wu, Jialian and Yu, Xiaodong and Liu, Zicheng and Barsoum, Emad},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/guo2026iclr-imagedoctor/}
}