RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference

Abstract

Deploying large language model inference remains challenging due to their high computational overhead. Early exit optimizes model inference by adaptively reducing the number of inference layers. Current methods typically train internal classifiers or use heuristic methods to determine the exit layer. However, those methods either introduce significant training overheads or lead to performance degradation. To address these limitations, this paper proposes RAEE, a robust Retrieval-Augmented Early Exit framework that not only enables early exit but also enhances model performance through corrective exit information at intermediate layers. This paper first demonstrates that the early exit problem can be effectively modeled as a distribution prediction problem, in which the distribution can be further approximated through the exit information of similar data. Subsequently, this paper introduces the process of collecting exit information of correct predictions and the steps to construct the retrieval database. Finally, leveraging the pre-constructed retrieval database, RAEE utilizes the exit information from retrieved similar data to guide the backbone model's exit. Experimental results demonstrate that RAEE can not only accelerate inference while achieving robust zero-shot performance across eight downstream tasks.

Cite

Text

Huang et al. "RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference." International Conference on Learning Representations, 2026.

Markdown

[Huang et al. "RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-raee/)

BibTeX

@inproceedings{huang2026iclr-raee,
  title     = {{RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference}},
  author    = {Huang, Lianming and Wu, Shangyu and Cui, Yufei and Xiong, Ying and Hu, Haibo and Liu, Xue and Kuo, Tei-Wei and Guan, Nan and Xue, Chun Jason},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/huang2026iclr-raee/}
}