Attention-Guided Masked Autoencoders for Learning Image Representations
Abstract
Masked autoencoders (MAEs) have established themselves as a powerful pre-training method for computer vision tasks. While vanilla MAEs put equal emphasis on reconstructing the individual parts of the image we propose to inform the reconstruction process through an attention-guided loss function. By leveraging advances in unsupervised object discovery we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects. Thus we incentivize the model to learn improved representations of the scene for a variety of tasks. Our evaluations show that our pre-trained models produce off-the-shelf representations more effective than the vanilla MAE for such tasks demonstrated by improved linear probing and k-NN classification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds and changes in texture.
Cite
Text
Sick et al. "Attention-Guided Masked Autoencoders for Learning Image Representations." Winter Conference on Applications of Computer Vision, 2025.Markdown
[Sick et al. "Attention-Guided Masked Autoencoders for Learning Image Representations." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/sick2025wacv-attentionguided/)BibTeX
@inproceedings{sick2025wacv-attentionguided,
title = {{Attention-Guided Masked Autoencoders for Learning Image Representations}},
author = {Sick, Leon and Engel, Dominik and Hermosilla, Pedro and Ropinski, Timo},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2025},
pages = {836-846},
url = {https://mlanthology.org/wacv/2025/sick2025wacv-attentionguided/}
}