Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

Abstract

Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use a simple pretraining task of predicting the pairings between images and text captions. CLIP, however, is data hungry and requires more than 400M image text pairs for training. We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs. Our model transfers knowledge from pre-trained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP. Our method exceeds the previous SoTA of general zero-shot learning on ImageNet 21k+1k by 73% relatively with a ResNet50 image encoder and DeCLUTR text encoder. We also beat CLIP by 10.5% relatively on zero-shot evaluation on Google Open Images (19,958 classes).

Cite

Text

Cheng et al. "Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021. doi:10.1109/CVPRW53098.2021.00348

Markdown

[Cheng et al. "Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.](https://mlanthology.org/cvprw/2021/cheng2021cvprw-dataefficient/) doi:10.1109/CVPRW53098.2021.00348

BibTeX

@inproceedings{cheng2021cvprw-dataefficient,
  title     = {{Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation}},
  author    = {Cheng, Ruizhe and Wu, Bichen and Zhang, Peizhao and Vajda, Peter and Gonzalez, Joseph E.},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2021},
  pages     = {3119-3124},
  doi       = {10.1109/CVPRW53098.2021.00348},
  url       = {https://mlanthology.org/cvprw/2021/cheng2021cvprw-dataefficient/}
}