Pre-Training Concept Frequency Is Predictive of CLIP Zero-Shot Performance

Abstract

Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero-shot” generalization is for CLIP, as its pre-training datasets (e.g., YFCC-15M, LAION-2B etc.) likely contain many samples of the “zero-shot” concept. To study this, for the first time, we analyze the composition of concepts in the pre-training datasets of CLIP. We robustly demonstrate that far from being “zero-shot”, CLIP’s zero-shot classification performance is strongly predictable by the frequency of a concept seen during pre-training. Precisely, the downstream zero-shot performance improves linearly as the pre-training concept frequency grows exponentially i.e., they follow a log-linear scaling trend. Our data-centric investigation further highlights two key findings: (1) The extreme “data-hunger” of CLIP, i.e., growing inability of “zero-shot” prediction on long-tailed concepts, and (2) A surprising degree of mis-alignment across image-text pairs in the pre-training datasets.

Cite

Text

Udandarao et al. "Pre-Training Concept Frequency Is Predictive of CLIP Zero-Shot Performance." ICLR 2024 Workshops: DPFM, 2024.

Markdown

[Udandarao et al. "Pre-Training Concept Frequency Is Predictive of CLIP Zero-Shot Performance." ICLR 2024 Workshops: DPFM, 2024.](https://mlanthology.org/iclrw/2024/udandarao2024iclrw-pretraining/)

BibTeX

@inproceedings{udandarao2024iclrw-pretraining,
  title     = {{Pre-Training Concept Frequency Is Predictive of CLIP Zero-Shot Performance}},
  author    = {Udandarao, Vishaal and Prabhu, Ameya and Torr, Philip and Bibi, Adel and Albanie, Samuel and Bethge, Matthias},
  booktitle = {ICLR 2024 Workshops: DPFM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/udandarao2024iclrw-pretraining/}
}