Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping
Abstract
While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different Transformer modules, including blocks, MLP layers, and attention layers, through the lens of layer dropping. Surprisingly, despite the pivotal role of attention mechanisms in distinguishing Transformers from other architectures, we find that a large portion of attention layers exhibit excessively high redundancy and can be pruned without degrading performance. For example, LLaMA-3-70B achieves a 43.4\% speedup with only a 1.8\% drop in performance by pruning half of its attention layers. In contrast, dropping MLP layers severely impairs the model's ability to distinguish between tokens, leading to catastrophic performance degradation. Moreover, our analysis reveals that attention layer redundancy persists not only throughout training but is also evident in randomly initialized models. We attribute this redundancy to three key factors that constrain representational updates from attention layers: sparse attention patterns, over-smoothed token embeddings, and the low representational magnitude of attention outputs. Overall, our findings offer valuable insights into the internal redundancy of Transformer architectures and provide practical guidance for designing more efficient LLMs. The code is released at: https://github.com/CASE-Lab-UMD/LLM-Drop.
Cite
Text
He et al. "Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping." Transactions on Machine Learning Research, 2026.Markdown
[He et al. "Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/he2026tmlr-uncovering/)BibTeX
@article{he2026tmlr-uncovering,
title = {{Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping}},
author = {He, Shwai and Sun, Guoheng and Shen, Zheyu and Li, Ang},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/he2026tmlr-uncovering/}
}