ContextRef: Evaluating Referenceless Metrics for Image Description Generation
Abstract
Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.
Cite
Text
Kreiss et al. "ContextRef: Evaluating Referenceless Metrics for Image Description Generation." International Conference on Learning Representations, 2024.Markdown
[Kreiss et al. "ContextRef: Evaluating Referenceless Metrics for Image Description Generation." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/kreiss2024iclr-contextref/)BibTeX
@inproceedings{kreiss2024iclr-contextref,
title = {{ContextRef: Evaluating Referenceless Metrics for Image Description Generation}},
author = {Kreiss, Elisa and Zelikman, Eric and Potts, Christopher and Haber, Nick},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/kreiss2024iclr-contextref/}
}