RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Abstract

The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter—a draft LLM operating on shortened retrieval contexts—to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2$\times$ speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.

Cite

Text

Chen et al. "RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Chen et al. "RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/chen2025icml-rapid/)

BibTeX

@inproceedings{chen2025icml-rapid,
  title     = {{RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding}},
  author    = {Chen, Guanzheng and Feng, Qilong and Ni, Jinjie and Li, Xin and Shieh, Michael Qizhe},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {8093-8107},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/chen2025icml-rapid/}
}