CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

Abstract

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0$%$ mAP absolute gain in performance on Pascal VOC classification, and a +22.1$%$ top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite

Cite

Text

Shrivastava et al. "CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision." Artificial Intelligence and Statistics, 2023.

Markdown

[Shrivastava et al. "CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision." Artificial Intelligence and Statistics, 2023.](https://mlanthology.org/aistats/2023/shrivastava2023aistats-cliplite/)

BibTeX

@inproceedings{shrivastava2023aistats-cliplite,
  title     = {{CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision}},
  author    = {Shrivastava, Aman and Selvaraju, Ramprasaath R. and Naik, Nikhil and Ordonez, Vicente},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2023},
  pages     = {8433-8447},
  volume    = {206},
  url       = {https://mlanthology.org/aistats/2023/shrivastava2023aistats-cliplite/}
}