SLIP: Self-Supervision Meets Language-Image Pre-Training

Abstract

Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP.

Cite

Text

Mu et al. "SLIP: Self-Supervision Meets Language-Image Pre-Training." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19809-0_30

Markdown

[Mu et al. "SLIP: Self-Supervision Meets Language-Image Pre-Training." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/mu2022eccv-slip/) doi:10.1007/978-3-031-19809-0_30

BibTeX

@inproceedings{mu2022eccv-slip,
  title     = {{SLIP: Self-Supervision Meets Language-Image Pre-Training}},
  author    = {Mu, Norman and Kirillov, Alexander and Wagner, David and Xie, Saining},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19809-0_30},
  url       = {https://mlanthology.org/eccv/2022/mu2022eccv-slip/}
}