How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining

Abstract

Masked reconstruction, which predicts randomly masked patches from unmasked ones, has emerged as an important approach in self-supervised pretraining. However, the theoretical understanding of masked pretraining is rather limited, especially for the foundational architecture of transformers. In this paper, to the best of our knowledge, we provide the first end-to-end theoretical guarantee of learning one-layer transformers in masked reconstruction self-supervised pretraining. On the conceptual side, we posit a mechanism of how transformers trained with masked vision pretraining objectives produce empirically observed **local and diverse** attention patterns, on data distributions with spatial structures that highlight *feature-position correlations*. On the technical side, our end-to-end characterization of training dynamics in softmax-attention models simultaneously accounts for input and position embeddings, which is developed based on a careful analysis tracking the interplay between feature-wise and position-wise attention correlations.

Cite

Text

Huang et al. "How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining." ICML 2024 Workshops: TF2M, 2024.

Markdown

[Huang et al. "How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/huang2024icmlw-transformers/)

BibTeX

@inproceedings{huang2024icmlw-transformers,
  title     = {{How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining}},
  author    = {Huang, Yu and Wen, Zixin and Chi, Yuejie and Liang, Yingbin},
  booktitle = {ICML 2024 Workshops: TF2M},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/huang2024icmlw-transformers/}
}