Transferring Pre-Trained Multimodal Representations with Cross-Modal Similarity Matching

Abstract

Despite surprising performance on zero-shot transfer, pre-training a large-scale multimodal model is often prohibitive as it requires a huge amount of data and computing resources. In this paper, we propose a method (BeamCLIP) that can effectively transfer the representations of a large pre-trained multimodal model (CLIP-ViT) into a small target model (e.g., ResNet-18). For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model by matching the relative similarity distribution across text prompt embeddings. To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts. Our experiments show that unsupervised representation transfer of a pre-trained vision-language model enables a small ResNet-18 to achieve a better ImageNet-1K top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning (SSL) methods (e.g., SimCLR: 51.8%, SwAV: 63.7%), while closing the gap with supervised learning (69.8%).

Cite

Text

Kim et al. "Transferring Pre-Trained Multimodal Representations with Cross-Modal Similarity Matching." Neural Information Processing Systems, 2022.

Markdown

[Kim et al. "Transferring Pre-Trained Multimodal Representations with Cross-Modal Similarity Matching." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/kim2022neurips-transferring/)

BibTeX

@inproceedings{kim2022neurips-transferring,
  title     = {{Transferring Pre-Trained Multimodal Representations with Cross-Modal Similarity Matching}},
  author    = {Kim, Byoungjip and Choi, Sungik and Hwang, Dasol and Lee, Moontae and Lee, Honglak},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/kim2022neurips-transferring/}
}