How Smooth Is Attention?

Abstract

Self-attention and masked self-attention are at the heart of Transformers’ outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties — which are key when it comes to analyzing robustness and expressive power — is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length $n$ and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$ up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length $n$ is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of $n$. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.

Cite

Text

Castin et al. "How Smooth Is Attention?." International Conference on Machine Learning, 2024.

Markdown

[Castin et al. "How Smooth Is Attention?." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/castin2024icml-smooth/)

BibTeX

@inproceedings{castin2024icml-smooth,
  title     = {{How Smooth Is Attention?}},
  author    = {Castin, Valérie and Ablin, Pierre and Peyré, Gabriel},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {5817-5840},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/castin2024icml-smooth/}
}