Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation

Abstract

Text-to-image (T2I) models have advanced rapidly with diffusion-based breakthroughs, yet their evaluation remains challenging. Human assessments are costly, and existing automated metrics lack accurate compositional understanding. To address these limitations, we introduce PSG-Bench, a novel benchmark featuring 5K text prompts designed to evaluate the capabilities of advanced T2I models. Additionally, we propose PSGEval, a scene graph-based evaluation metric that converts generated images into structured representations and applies graph matching techniques for accurate and scalable assessment. PSGEval is a detection based evaluation metric without relying on QA generations. Our experimental results demonstrate that PSGEval aligns well with human evaluations, mitigating biases present in existing automated metrics. We further provide a detailed ranking and analysis of recent T2I models, offering a robust framework for future research in T2I evaluation.

Cite

Text

Deng et al. "Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation." International Conference on Computer Vision, 2025.

Markdown

[Deng et al. "Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/deng2025iccv-leveraging/)

BibTeX

@inproceedings{deng2025iccv-leveraging,
  title     = {{Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation}},
  author    = {Deng, Xueqing and Yang, Linjie and Yu, Qihang and Yang, Chenglin and Chen, Liang-Chieh},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {15107-15116},
  url       = {https://mlanthology.org/iccv/2025/deng2025iccv-leveraging/}
}