When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

Abstract

Large Language Models (LLMs) are known for their performance, but we uncover a significant structural inefficiency: a phenomenon we term attention collapse. In many pre-trained decoder-style LLMs, the attention matrices in deeper layers degenerate, collapsing to near rank-one structures. These underutilized layers, which we call lazy layers, are redundant and impair model efficiency. To address this, we introduce Inheritune, a simple yet powerful training recipe designed to build smaller, stronger language models. Inheritune initializes a compact model by inheriting the potent early layers from a larger pre-trained model and then progressively trains and expands it. Our experiments on various models, including the GPT-2 family, demonstrate that models trained with Inheritune can match or even surpass the performance of their larger counterparts, despite having significantly fewer layers. This work presents a novel path toward model compression by design, enabling the creation of compact, yet highly performant language models.

Cite

Text

Sanyal et al. "When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models." Transactions on Machine Learning Research, 2026.

Markdown

[Sanyal et al. "When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/sanyal2026tmlr-attention/)

BibTeX

@article{sanyal2026tmlr-attention,
  title     = {{When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models}},
  author    = {Sanyal, Sunny and Shwartz-Ziv, Ravid and Dimakis, Alex and Sanghavi, Sujay},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/sanyal2026tmlr-attention/}
}