Exploring Target Representations for Masked Autoencoders
Abstract
Masked autoencoders have become popular training paradigms for self-supervised visual representation learning. These models randomly mask a portion of the input and reconstruct the masked portion according to assigned target representations. In this paper, we show that a careful choice of the target representation is unnecessary for learning good visual representation since different targets tend to derive similarly behaved models. Driven by this observation, we propose a multi-stage masked distillation pipeline and use a randomly initialized model as the teacher, enabling us to effectively train high-capacity models without any effort to carefully design the target representation. On various downstream tasks, the proposed method to perform masked knowledge distillation with bootstrapped teachers (dbot) outperforms previous self-supervised methods by nontrivial margins. We hope our findings, as well as the proposed method, could motivate people to rethink the roles of target representations in pre-training masked autoencoders.
Cite
Text
Liu et al. "Exploring Target Representations for Masked Autoencoders." International Conference on Learning Representations, 2024.Markdown
[Liu et al. "Exploring Target Representations for Masked Autoencoders." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/liu2024iclr-exploring/)BibTeX
@inproceedings{liu2024iclr-exploring,
title = {{Exploring Target Representations for Masked Autoencoders}},
author = {Liu, Xingbin and Zhou, Jinghao and Kong, Tao and Lin, Xianming and Ji, Rongrong},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/liu2024iclr-exploring/}
}