The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis

Abstract

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.

Cite

Text

Barraco et al. "The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00512

Markdown

[Barraco et al. "The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/barraco2022cvprw-unreasonable/) doi:10.1109/CVPRW56347.2022.00512

BibTeX

@inproceedings{barraco2022cvprw-unreasonable,
  title     = {{The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis}},
  author    = {Barraco, Manuele and Cornia, Marcella and Cascianelli, Silvia and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2022},
  pages     = {4661-4669},
  doi       = {10.1109/CVPRW56347.2022.00512},
  url       = {https://mlanthology.org/cvprw/2022/barraco2022cvprw-unreasonable/}
}