Simple Token-Level Confidence Improves Caption Correctness
Abstract
The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence more than doubles image and group scores for compositional reasoning on Winoground. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.
Cite
Text
Petryk et al. "Simple Token-Level Confidence Improves Caption Correctness." Winter Conference on Applications of Computer Vision, 2024.Markdown
[Petryk et al. "Simple Token-Level Confidence Improves Caption Correctness." Winter Conference on Applications of Computer Vision, 2024.](https://mlanthology.org/wacv/2024/petryk2024wacv-simple/)BibTeX
@inproceedings{petryk2024wacv-simple,
title = {{Simple Token-Level Confidence Improves Caption Correctness}},
author = {Petryk, Suzanne and Whitehead, Spencer and Gonzalez, Joseph E. and Darrell, Trevor and Rohrbach, Anna and Rohrbach, Marcus},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2024},
pages = {5742-5752},
url = {https://mlanthology.org/wacv/2024/petryk2024wacv-simple/}
}