Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Assran, Mahmoud; Duval, Quentin; Misra, Ishan; Bojanowski, Piotr; Vincent, Pascal; Rabbat, Michael; LeCun, Yann; Ballas, Nicolas

doi:10.1109/CVPR52729.2023.01499

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas

CVPR 2023 pp. 15619-15629

doi:10.1109/CVPR52729.2023.01499 /cvpr/2023/assran2023cvpr-selfsupervised/

Abstract

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

PDF CVPR Semantic Scholar

Cite

Text

Assran et al. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01499

Markdown

[Assran et al. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/assran2023cvpr-selfsupervised/) doi:10.1109/CVPR52729.2023.01499

BibTeX

@inproceedings{assran2023cvpr-selfsupervised,
  title     = {{Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture}},
  author    = {Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {15619-15629},
  doi       = {10.1109/CVPR52729.2023.01499},
  url       = {https://mlanthology.org/cvpr/2023/assran2023cvpr-selfsupervised/}
}