Efficient Low Rank Attention for Long-Context Inference in Large Language Models

Li Tenghui, Guoxu Zhou, Xuyang Zhao, Yuning Qiu, Qibin Zhao

NeurIPS 2025

/neurips/2025/tenghui2025neurips-efficient/

Abstract

As the length of input text increases, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. In this work, Low Rank Query and Key attention (LRQK) is introduced, a two-stage framework that jointly decomposes full-precision query and key matrices into compact rank-\(r\) factors during the prefill stage, and then employs these low-dimensional projections to compute proxy attention scores in \(\mathcal{O}(lr)\) time at each decode step. By selecting only the top-\(k\) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism where only missing full-precision KV pairs are transferred, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal accuracy loss. Our code is available at \url{https://github.com/tenghuilee/LRQK}.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Tenghui et al. "Efficient Low Rank Attention for Long-Context Inference in Large Language Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Tenghui et al. "Efficient Low Rank Attention for Long-Context Inference in Large Language Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/tenghui2025neurips-efficient/)

BibTeX

@inproceedings{tenghui2025neurips-efficient,
  title     = {{Efficient Low Rank Attention for Long-Context Inference in Large Language Models}},
  author    = {Tenghui, Li and Zhou, Guoxu and Zhao, Xuyang and Qiu, Yuning and Zhao, Qibin},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/tenghui2025neurips-efficient/}
}