Classifier-Free Guidance Makes Image Captioning Models More Descriptive
Abstract
Image captioning is conventionally formulated as the task of generating captions that are similar to a set of human-generated reference captions, as measured using evaluation metrics such as CIDEr, ROUGE, and BLEU. Recent work has also explored reference-free captioning metrics based on the distance between generated captions and the corresponding images in the embedding space of a contrastively-trained image-text model such as CLIP. Here, we show that it is possible to trade off between reference-free and reference-based captioning metrics by decoding from a single autoregressive captioning model using classifier-free guidance (Ho & Salimans, 2021). Compared to standard greedy decoding, decoding from the same model with a guidance scale of 3 substantially improves caption→image retrieval performance when captions and images are embedded using CLIP (recall@1 49.4% vs. 26.5%) and CLIPScore (0.808 vs. 0.775), but greatly worsens standard reference-based captioning metrics (e.g., CIDEr 41.7 vs 126.1). Manual inspection reveals that higher guidance scales produce more descriptive but less grammatical captions.
Cite
Text
Kornblith et al. "Classifier-Free Guidance Makes Image Captioning Models More Descriptive." ICLR 2023 Workshops: MRL, 2023.Markdown
[Kornblith et al. "Classifier-Free Guidance Makes Image Captioning Models More Descriptive." ICLR 2023 Workshops: MRL, 2023.](https://mlanthology.org/iclrw/2023/kornblith2023iclrw-classifierfree/)BibTeX
@inproceedings{kornblith2023iclrw-classifierfree,
title = {{Classifier-Free Guidance Makes Image Captioning Models More Descriptive}},
author = {Kornblith, Simon and Li, Lala and Wang, Zirui and Nguyen, Thao},
booktitle = {ICLR 2023 Workshops: MRL},
year = {2023},
url = {https://mlanthology.org/iclrw/2023/kornblith2023iclrw-classifierfree/}
}