Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Abstract

We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$, where $M$ is the sample size and $\beta$ is the H\"older smoothness of the activation function. Importantly, this rate is independent of the embedding dimension $d$, the number of tokens $N$, and the rank $r$ of the weight matrix, provided that $rd \le (M/\log M)^{\frac{1}{2\beta+1}}$. These results highlight a fundamental statistical efficiency of attention-style models, even when the weight matrix and activation are not separately identifiable, and provide a theoretical understanding of attention mechanisms and guidance on training.

Cite

Text

Zucker et al. "Minimax Rates for Learning Pairwise Interactions in Attention-Style Models." International Conference on Learning Representations, 2026.

Markdown

[Zucker et al. "Minimax Rates for Learning Pairwise Interactions in Attention-Style Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zucker2026iclr-minimax/)

BibTeX

@inproceedings{zucker2026iclr-minimax,
  title     = {{Minimax Rates for Learning Pairwise Interactions in Attention-Style Models}},
  author    = {Zucker, Shai and Wang, Xiong and Lu, Fei and Seroussi, Inbar},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zucker2026iclr-minimax/}
}