Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations

Abstract

Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.

Cite

Text

Przewięźlikowski et al. "Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations." International Conference on Computer Vision, 2025.

Markdown

[Przewięźlikowski et al. "Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/przewiezlikowski2025iccv-beyond/)

BibTeX

@inproceedings{przewiezlikowski2025iccv-beyond,
  title     = {{Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations}},
  author    = {Przewięźlikowski, Marcin and Balestriero, Randall and Jasiński, Wojciech and Śmieja, Marek and Zieliński, Bartosz},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {23442-23452},
  url       = {https://mlanthology.org/iccv/2025/przewiezlikowski2025iccv-beyond/}
}