Evaluating and Improving Compositional Text-to-Visual Generation
Abstract
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore – a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt – significantly outperforms previous metrics such as CLIP-Score. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore and ImageReward at improving human ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. Lastly, we identify areas for improvement in VQAScore, such as addressing fine-grained visual details. Despite mild limitations, VQAScore serves as the best automated metric as well as reward function for improving prompt alignment. We will release over 80,000 human ratings to facilitate scientific benchmarking of both generative models and automated metrics.
Cite
Text
Li et al. "Evaluating and Improving Compositional Text-to-Visual Generation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00538Markdown
[Li et al. "Evaluating and Improving Compositional Text-to-Visual Generation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/li2024cvprw-evaluating/) doi:10.1109/CVPRW63382.2024.00538BibTeX
@inproceedings{li2024cvprw-evaluating,
title = {{Evaluating and Improving Compositional Text-to-Visual Generation}},
author = {Li, Baiqi and Lin, Zhiqiu and Pathak, Deepak and Li, Jiayao and Fei, Yixin and Wu, Kewen and Xia, Xide and Zhang, Pengchuan and Neubig, Graham and Ramanan, Deva},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {5290-5301},
doi = {10.1109/CVPRW63382.2024.00538},
url = {https://mlanthology.org/cvprw/2024/li2024cvprw-evaluating/}
}