Masked Siamese Networks for Label-Efficient Learning

Abstract

We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance, on ImageNet-1K, with only 5,000 annotated images, our large MSN models achieves 72.1% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.1% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark. Our code is publicly available at https://github.com/facebookresearch/msn.

Cite

Text

Assran et al. "Masked Siamese Networks for Label-Efficient Learning." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19821-2_26

Markdown

[Assran et al. "Masked Siamese Networks for Label-Efficient Learning." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/assran2022eccv-masked/) doi:10.1007/978-3-031-19821-2_26

BibTeX

@inproceedings{assran2022eccv-masked,
  title     = {{Masked Siamese Networks for Label-Efficient Learning}},
  author    = {Assran, Mahmoud and Caron, Mathilde and Misra, Ishan and Bojanowski, Piotr and Bordes, Florian and Vincent, Pascal and Joulin, Armand and Rabbat, Michael and Ballas, Nicolas},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19821-2_26},
  url       = {https://mlanthology.org/eccv/2022/assran2022eccv-masked/}
}