Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

CVPR 2023 pp. 6241-6251

doi:10.1109/CVPR52729.2023.00604 /cvpr/2023/kong2023cvpr-understanding-a/

Abstract

Recently, Masked Image Modeling (MIM) achieves great success in self-supervised visual recognition. However, as a reconstruction-based framework, it is still an open question to understand how MIM works, since MIM appears very different from previous well-studied siamese approaches such as contrastive learning. In this paper, we propose a new viewpoint: MIM implicitly learns occlusion-invariant features, which is analogous to other siamese methods while the latter learns other invariance. By relaxing MIM formulation into an equivalent siamese form, MIM methods can be interpreted in a unified framework with conventional methods, among which only a) data transformations, i.e. what invariance to learn, and b) similarity measurements are different. Furthermore, taking MAE (He et al., 2021) as a representative example of MIM, we empirically find the success of MIM models relates a little to the choice of similarity functions, but the learned occlusion invariant feature introduced by masked image -- it turns out to be a favored initialization for vision transformers, even though the learned feature could be less semantic. We hope our findings could inspire researchers to develop more powerful self-supervised methods in computer vision community.

PDF CVPR Semantic Scholar

Cite

Text

Kong and Zhang. "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00604

Markdown

[Kong and Zhang. "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/kong2023cvpr-understanding-a/) doi:10.1109/CVPR52729.2023.00604

BibTeX

@inproceedings{kong2023cvpr-understanding-a,
  title     = {{Understanding Masked Image Modeling via Learning Occlusion Invariant Feature}},
  author    = {Kong, Xiangwen and Zhang, Xiangyu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {6241-6251},
  doi       = {10.1109/CVPR52729.2023.00604},
  url       = {https://mlanthology.org/cvpr/2023/kong2023cvpr-understanding-a/}
}