Leveraging Multimodal Large Language Models for Joint Discrete and Continuous Evaluation in Text-to-Image Alignment

Abstract

Text-to-image (T2I) generation has seen rapid advancements with the development of powerful diffusion-based and transformer-based models. These models enable the creation of both artistic illustrations and highly photorealistic images, making it increasingly important to accurately evaluate how well the generated images align with their corresponding text prompts. In this paper, we propose a novel method for evaluating image-text alignment that leverages advanced multimodal large language models (MLLMs). First, we develop a specialized prompt engineering strategy that targets fine-grained elements, such as actions, spatial relationships, quantities, and orientations, guiding the model to capture subtle details in both the textual and visual modalities. Second, we perform supervised fine-tuning using a dual-loss strategy to minimize discrepancies between predicted continuous scores and ground truth, thereby providing a more precise measure of alignment. Lastly, we propose a regression retraining approach that extracts intermediate features from the MLLM's decoder and employs a multilayer perceptron to predict alignment scores. The experimental results demonstrate that the proposed methods significantly improve both overall and fine-grained alignment evaluations, offering a robust solution for T2I alignment assessment.

Cite

Text

Zhang et al. "Leveraging Multimodal Large Language Models for Joint Discrete and Continuous Evaluation in Text-to-Image Alignment." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Zhang et al. "Leveraging Multimodal Large Language Models for Joint Discrete and Continuous Evaluation in Text-to-Image Alignment." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/zhang2025cvprw-leveraging/)

BibTeX

@inproceedings{zhang2025cvprw-leveraging,
  title     = {{Leveraging Multimodal Large Language Models for Joint Discrete and Continuous Evaluation in Text-to-Image Alignment}},
  author    = {Zhang, Zhichao and Li, Xinyue and Sun, Wei and Zhang, Zicheng and Li, Yunhao and Liu, Xiaohong and Zhai, Guangtao},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {977-986},
  url       = {https://mlanthology.org/cvprw/2025/zhang2025cvprw-leveraging/}
}