Critical Attention Scaling in Long-Context Transformers

Abstract

As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\text{\emph{attention scaling}}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $\beta_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $\beta_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $\beta_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.

Cite

Text

Chen et al. "Critical Attention Scaling in Long-Context Transformers." International Conference on Learning Representations, 2026.

Markdown

[Chen et al. "Critical Attention Scaling in Long-Context Transformers." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/chen2026iclr-critical/)

BibTeX

@inproceedings{chen2026iclr-critical,
  title     = {{Critical Attention Scaling in Long-Context Transformers}},
  author    = {Chen, Shi and Lin, Zhengjiang and Polyanskiy, Yury and Rigollet, Philippe},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/chen2026iclr-critical/}
}