MOCA: Self-Supervised Representation Learning by Predicting Masked Online Codebook Assignments
Abstract
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods. We provide the implementation code at https://github.com/valeoai/MOCA.
Cite
Text
Gidaris et al. "MOCA: Self-Supervised Representation Learning by Predicting Masked Online Codebook Assignments." Transactions on Machine Learning Research, 2024.Markdown
[Gidaris et al. "MOCA: Self-Supervised Representation Learning by Predicting Masked Online Codebook Assignments." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/gidaris2024tmlr-moca/)BibTeX
@article{gidaris2024tmlr-moca,
title = {{MOCA: Self-Supervised Representation Learning by Predicting Masked Online Codebook Assignments}},
author = {Gidaris, Spyros and Bursuc, Andrei and Siméoni, Oriane and Vobecký, Antonín and Komodakis, Nikos and Cord, Matthieu and Perez, Patrick},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/gidaris2024tmlr-moca/}
}