CLIP-KD: An Empirical Study of CLIP Model Distillation

Yang, Chuanguang; An, Zhulin; Huang, Libo; Bi, Junyu; Yu, Xinqiang; Yang, Han; Diao, Boyu; Xu, Yongjun

doi:10.1109/CVPR52733.2024.01510

CLIP-KD: An Empirical Study of CLIP Model Distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, Yongjun Xu

CVPR 2024 pp. 15952-15962

doi:10.1109/CVPR52733.2024.01510 /cvpr/2024/yang2024cvpr-clipkd/

Abstract

Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies including relation feature gradient and contrastive paradigms to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher CLIP-KD achieves 57.5% and 55.4% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50 surpassing the original CLIP without KD by 20.5% and 20.1% margins respectively. Our code is released on https://github.com/winycg/CLIP-KD.

PDF CVPR Semantic Scholar

Cite

Text

Yang et al. "CLIP-KD: An Empirical Study of CLIP Model Distillation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01510

Markdown

[Yang et al. "CLIP-KD: An Empirical Study of CLIP Model Distillation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yang2024cvpr-clipkd/) doi:10.1109/CVPR52733.2024.01510

BibTeX

@inproceedings{yang2024cvpr-clipkd,
  title     = {{CLIP-KD: An Empirical Study of CLIP Model Distillation}},
  author    = {Yang, Chuanguang and An, Zhulin and Huang, Libo and Bi, Junyu and Yu, Xinqiang and Yang, Han and Diao, Boyu and Xu, Yongjun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {15952-15962},
  doi       = {10.1109/CVPR52733.2024.01510},
  url       = {https://mlanthology.org/cvpr/2024/yang2024cvpr-clipkd/}
}