How Much Can CLIP Benefit Vision-and-Language Tasks?

Abstract

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.

Cite

Text

Shen et al. "How Much Can CLIP Benefit Vision-and-Language Tasks?." International Conference on Learning Representations, 2022.

Markdown

[Shen et al. "How Much Can CLIP Benefit Vision-and-Language Tasks?." International Conference on Learning Representations, 2022.](https://mlanthology.org/iclr/2022/shen2022iclr-much/)

BibTeX

@inproceedings{shen2022iclr-much,
  title     = {{How Much Can CLIP Benefit Vision-and-Language Tasks?}},
  author    = {Shen, Sheng and Li, Liunian Harold and Tan, Hao and Bansal, Mohit and Rohrbach, Anna and Chang, Kai-Wei and Yao, Zhewei and Keutzer, Kurt},
  booktitle = {International Conference on Learning Representations},
  year      = {2022},
  url       = {https://mlanthology.org/iclr/2022/shen2022iclr-much/}
}