Efficient Large Language Model Inference with Neural Block Linearization
Abstract
The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce *Neural Block Linearization* (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in *DeepSeek-R1-Distill-Llama-8B* increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs.
Cite
Text
Erdogan et al. "Efficient Large Language Model Inference with Neural Block Linearization." Advances in Neural Information Processing Systems, 2025.Markdown
[Erdogan et al. "Efficient Large Language Model Inference with Neural Block Linearization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/erdogan2025neurips-efficient/)BibTeX
@inproceedings{erdogan2025neurips-efficient,
title = {{Efficient Large Language Model Inference with Neural Block Linearization}},
author = {Erdogan, Mete and Tonin, Francesco and Cevher, Volkan},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/erdogan2025neurips-efficient/}
}