Leveraging the True Depth of LLMs

Abstract

The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense computational cost. While recent work has shown that many LLM layers can be reordered or even removed with minimal impact on accuracy, these insights have not been translated into significant inference speedups. To bridge this gap, we introduce a novel method that restructures the computational graph by grouping and evaluating consecutive layer pairs in parallel. This approach, requiring no retraining, yields a 1.19x throughput gain on Llama 2 7B while reducing the average benchmark accuracy by only 1.5%. We demonstrate the practical value of this method for large-scale LLM deployment and show that some of the lost accuracy can be recovered with lightweight fine-tuning of the parallelized layers.

Cite

Text

González et al. "Leveraging the True Depth of LLMs." Transactions on Machine Learning Research, 2026.

Markdown

[González et al. "Leveraging the True Depth of LLMs." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/gonzalez2026tmlr-leveraging/)

BibTeX

@article{gonzalez2026tmlr-leveraging,
  title     = {{Leveraging the True Depth of LLMs}},
  author    = {González, Ramón Calvo and Paliotta, Daniele and Pagliardini, Matteo and Jaggi, Martin and Fleuret, François},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/gonzalez2026tmlr-leveraging/}
}