Accelerating Iterative Retrieval-Augmented Language Model Serving with Speculation

Abstract

This paper introduces RaLMSpec, a framework that accelerates iterative retrieval-augmented language model (RaLM) with speculative retrieval and batched verification. RaLMSpec further introduces several important systems optimizations, including prefetching, optimal speculation stride scheduler, and asynchronous verification. The combination of these techniques allows RaLMSPec to significantly outperform existing systems. For document-level iterative RaLM serving, evaluation over three LLMs on four QA datasets shows that RaLMSpec improves over existing approaches by $1.75$-$2.39\times$, $1.04$-$1.39\times$, and $1.31$-$1.77\times$ when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively. For token-level iterative RaLM (KNN-LM) serving, RaLMSpec is up to $7.59\times$ and $2.45\times$ faster than existing methods for exact dense and approximate dense retrievers, respectively.

Cite

Text

Zhang et al. "Accelerating Iterative Retrieval-Augmented Language Model Serving with Speculation." International Conference on Machine Learning, 2024.

Markdown

[Zhang et al. "Accelerating Iterative Retrieval-Augmented Language Model Serving with Speculation." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/zhang2024icml-accelerating/)

BibTeX

@inproceedings{zhang2024icml-accelerating,
  title     = {{Accelerating Iterative Retrieval-Augmented Language Model Serving with Speculation}},
  author    = {Zhang, Zhihao and Zhu, Alan and Yang, Lijie and Xu, Yihua and Li, Lanting and Phothilimthana, Phitchaya Mangpo and Jia, Zhihao},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {60626-60643},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/zhang2024icml-accelerating/}
}