Enhancing Reward Models for High-Quality Image Generation: Beyond Text-Image Alignment
Abstract
Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train a HP (High-Preference) score model using solely the image modality, aiming to enhance image quality in aspects such as aesthetics and detail refinement while maintaining achieved text-image alignment.Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical foundation and empirical support for the evolution of image generation technology toward better alignment with higher-order human aesthetic preferences.
Cite
Text
Ba et al. "Enhancing Reward Models for High-Quality Image Generation: Beyond Text-Image Alignment." International Conference on Computer Vision, 2025.Markdown
[Ba et al. "Enhancing Reward Models for High-Quality Image Generation: Beyond Text-Image Alignment." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ba2025iccv-enhancing/)BibTeX
@inproceedings{ba2025iccv-enhancing,
title = {{Enhancing Reward Models for High-Quality Image Generation: Beyond Text-Image Alignment}},
author = {Ba, Ying and Zhang, Tianyu and Bai, Yalong and Mo, Wenyi and Liang, Tao and Su, Bing and Wen, Ji-Rong},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {19022-19031},
url = {https://mlanthology.org/iccv/2025/ba2025iccv-enhancing/}
}