RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Abstract

Transformer-based Large Language Models (LLMs) have become increasingly important. However, scaling LLMs to longer contexts incurs slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper presents RetrievalAttention, a training-free approach to both accelerate the decoding phase and reduce GPU memory consumption by pre-building KV vector indexes for fixed contexts and maintaining them in CPU memory for efficient retrieval. Unlike conventional KV cache methods, RetrievalAttention integrate approximate nearest neighbor search (ANNS) indexes into attention computation. We observe that off-the-shelf ANNS techniques often fail due to the out-of-distribution (OOD) nature of query and key vectors in attention mechanisms. RetrievalAttention overcomes this with an attention-aware vector index. Our evaluation shows RetrievalAttention achieves near full attention accuracy while accessing only 1-3\% of the data, significantly reducing inference costs. Remarkably, RetrievalAttention enables LLMs with 8B parameters to handle 128K tokens on a single NVIDIA RTX4090 (24GB), achieving a decoding speed of 0.107 seconds per token.

Cite

Text

Liu et al. "RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval." Advances in Neural Information Processing Systems, 2025.

Markdown

[Liu et al. "RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/liu2025neurips-retrievalattention/)

BibTeX

@inproceedings{liu2025neurips-retrievalattention,
  title     = {{RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval}},
  author    = {Liu, Di and Chen, Meng and Lu, Baotong and Jiang, Huiqiang and Han, Zhenhua and Zhang, Qianxi and Chen, Qi and Zhang, Chengruidong and Ding, Bailu and Zhang, Kai and Chen, Chen and Yang, Fan and Yang, Yuqing and Qiu, Lili},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/liu2025neurips-retrievalattention/}
}