Image-Caption Encoding for Improving Zero-Shot Generalization

Abstract

Recent advances in vision-language models have combined contrastive approaches with generative methods to achieve state-of-the-art (SOTA) on downstream inference tasks like zero-shot image classification. However a persistent issue of these models for image classification is their out-of-distribution (OOD) generalization capabilities. We first show that when an OOD datapoint is misclassified the correct class can be typically found in the Top-K predicted classes. In order to steer the model prediction toward the correct class within the top predicted classes we propose the Image-Caption Encoding (ICE) method a straightforward approach that directly enforces consistency between the image-conditioned and caption-conditioned predictions at evaluation time only. Intuitively we take advantage of unique properties of the generated captions to guide our local search for the correct class label within the Top-K predicted classes. We show that our method can be easily combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on average and up to 3% on challenging datasets.

Cite

Text

Yu et al. "Image-Caption Encoding for Improving Zero-Shot Generalization." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Yu et al. "Image-Caption Encoding for Improving Zero-Shot Generalization." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/yu2025wacv-imagecaption/)

BibTeX

@inproceedings{yu2025wacv-imagecaption,
  title     = {{Image-Caption Encoding for Improving Zero-Shot Generalization}},
  author    = {Yu, Eric and Liao, Christopher and Ravi, Sathvik and Tsiligkaridis, Theodoros and Kulis, Brian},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {6977-6986},
  url       = {https://mlanthology.org/wacv/2025/yu2025wacv-imagecaption/}
}