Rethinking Patch Dependence for Masked Autoencoders
Abstract
In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io/
Cite
Text
Fu et al. "Rethinking Patch Dependence for Masked Autoencoders." Transactions on Machine Learning Research, 2025.Markdown
[Fu et al. "Rethinking Patch Dependence for Masked Autoencoders." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/fu2025tmlr-rethinking/)BibTeX
@article{fu2025tmlr-rethinking,
title = {{Rethinking Patch Dependence for Masked Autoencoders}},
author = {Fu, Letian and Lian, Long and Wang, Renhao and Shi, Baifeng and Wang, XuDong and Yala, Adam and Darrell, Trevor and Efros, Alexei A and Goldberg, Ken},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/fu2025tmlr-rethinking/}
}