On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Abstract

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning.

Cite

Text

Li et al. "On the Surprising Effectiveness of Attention Transfer for Vision Transformers." Neural Information Processing Systems, 2024. doi:10.52202/079017-3619

Markdown

[Li et al. "On the Surprising Effectiveness of Attention Transfer for Vision Transformers." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/li2024neurips-surprising/) doi:10.52202/079017-3619

BibTeX

@inproceedings{li2024neurips-surprising,
  title     = {{On the Surprising Effectiveness of Attention Transfer for Vision Transformers}},
  author    = {Li, Alexander C. and Tian, Yuandong and Chen, Beidi and Pathak, Deepak and Chen, Xinlei},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3619},
  url       = {https://mlanthology.org/neurips/2024/li2024neurips-surprising/}
}