Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Abstract
Scaling the input context length of a large language model (LLM) incurs a significant increase in computation cost and memory footprint to maintain the attention key-value (KV) cache. Existing KV cache compression methods suffer from inefficient compression strategies and limited memory reduction effects, making it difficult for LLMs to conduct long-context inference on consumer-grade devices, especially when inferring long-context stream input. Such obstacles prevent consumer-grade devices from supporting more complex applications, creating challenges for the democratization of LLMs. To overcome this, we propose Locret, a framework to create an eviction policy compatible with chunked prefill. By evaluating the causal importance of KV cache units using \textit{retaining heads}, Locret enables precise eviction of cache units, facilitating efficient long-context inference. In our empirical studies, Locret outperforms the recent popular and competitive approaches in terms of memory efficiency and generation quality --- Locret achieves up to $20\times$ of KV cache compression ratio within less than $10\%$ performance loss. Furthermore, Locret achieves 128K+ long-context inference on a single NVIDIA 4090 GPU without compromising generation quality and only costs $<1$ GPU hour of additional training.
Cite
Text
Huang et al. "Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices." Transactions on Machine Learning Research, 2025.Markdown
[Huang et al. "Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/huang2025tmlr-locret/)BibTeX
@article{huang2025tmlr-locret,
title = {{Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices}},
author = {Huang, Yuxiang and Yuan, Binhang and Han, Xu and Xiao, Chaojun and Liu, Zhiyuan},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/huang2025tmlr-locret/}
}