When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models
Abstract
Large Language Models (LLMs) are known for their performance, but we uncover a significant structural inefficiency: a phenomenon we term attention collapse. In many pre-trained decoder-style LLMs, the attention matrices in deeper layers degenerate, collapsing to near rank-one structures. These underutilized layers, which we call lazy layers, are redundant and impair model efficiency. To address this, we introduce Inheritune, a simple yet powerful training recipe designed to build smaller, stronger language models. Inheritune initializes a compact model by inheriting the potent early layers from a larger pre-trained model and then progressively trains and expands it. Our experiments on various models, including the GPT-2 family, demonstrate that models trained with Inheritune can match or even surpass the performance of their larger counterparts, despite having significantly fewer layers. This work presents a novel path toward model compression by design, enabling the creation of compact, yet highly performant language models.
Cite
Text
Sanyal et al. "When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models." Transactions on Machine Learning Research, 2026.Markdown
[Sanyal et al. "When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/sanyal2026tmlr-attention/)BibTeX
@article{sanyal2026tmlr-attention,
title = {{When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models}},
author = {Sanyal, Sunny and Shwartz-Ziv, Ravid and Dimakis, Alex and Sanghavi, Sujay},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/sanyal2026tmlr-attention/}
}