Masked Autoencoders Are Secretly Efficient Learners

Zihao Wei, Chen Wei, Jieru Mei, Yutong Bai, Zeyu Wang, Xianhang Li, Hongru Zhu, Huiyu Wang, Alan L. Yuille, Yuyin Zhou, Cihang Xie

CVPRW 2024 pp. 7986-7995

doi:10.1109/CVPRW63382.2024.00797 /cvprw/2024/wei2024cvprw-masked/

Abstract

This paper provides an efficiency study of training Masked Autoencoders (MAE), a framework introduced by He et al. [13] for pre-training Vision Transformers (ViTs). Our results surprisingly reveal that MAE can learn at a faster speed and with fewer training samples while maintaining high performance. To accelerate its training, our changes are simple and straightforward: in the pre-training stage, we aggressively increase the masking ratio, decrease the number of training epochs, and reduce the decoder depth to lower the pre-training cost; in the fine-tuning stage, we demonstrate that layer-wise learning rate decay plays a vital role in unlocking the full potential of pre-trained models. Under this setup, we further verify the sample efficiency of MAE: training MAE is hardly affected even when using only 20% of the original training set.By combining these strategies, we are able to accelerate MAE pre-training by a factor of 82 or more, with little performance drop. For example, we are able to pre-train a ViT-B in ~9 hours using a single NVIDIA A100 GPU and achieve 82.9% top-1 accuracy on the downstream ImageNet classification task. Additionally, we also verify the speed acceleration on another MAE extension, SupMAE.

CVPRW Semantic Scholar

Cite

Text

Wei et al. "Masked Autoencoders Are Secretly Efficient Learners." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00797

Markdown

[Wei et al. "Masked Autoencoders Are Secretly Efficient Learners." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/wei2024cvprw-masked/) doi:10.1109/CVPRW63382.2024.00797

BibTeX

@inproceedings{wei2024cvprw-masked,
  title     = {{Masked Autoencoders Are Secretly Efficient Learners}},
  author    = {Wei, Zihao and Wei, Chen and Mei, Jieru and Bai, Yutong and Wang, Zeyu and Li, Xianhang and Zhu, Hongru and Wang, Huiyu and Yuille, Alan L. and Zhou, Yuyin and Xie, Cihang},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7986-7995},
  doi       = {10.1109/CVPRW63382.2024.00797},
  url       = {https://mlanthology.org/cvprw/2024/wei2024cvprw-masked/}
}