CSR: Achieving 1 Bit Key-Value Cache via Sparse Representation

Abstract

The emergence of long-context text applications utilizing large language models (LLMs) has presented significant scalability challenges, particularly in memory footprint. The linear growth of the Key-Value (KV) cache, which stores attention keys and values to reduce redundant computations, can significantly increase memory usage and may prevent models from functioning properly in memory-constrained environments. To address this issue, we propose a novel approach called Cache Sparse Representation (CSR), which converts the KV cache by transforming the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Furthermore, we introduce NeuralDict, a novel neural network-based method to automatically generate the dictionary used in our sparse representation. Our extensive experiments demonstrate that CSR matches the performance of state-of-the-art KV cache quantization algorithms while ensuring robust functionality in memory-constrained environments.

Cite

Text

Zhang et al. "CSR: Achieving 1 Bit Key-Value Cache via Sparse Representation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I24.34779

Markdown

[Zhang et al. "CSR: Achieving 1 Bit Key-Value Cache via Sparse Representation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhang2025aaai-csr/) doi:10.1609/AAAI.V39I24.34779

BibTeX

@inproceedings{zhang2025aaai-csr,
  title     = {{CSR: Achieving 1 Bit Key-Value Cache via Sparse Representation}},
  author    = {Zhang, Hongxuan and Zhao, Yao and Zheng, Jiaqi and Zhuang, Chenyi and Gu, Jinjie and Chen, Guihai},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {25860-25867},
  doi       = {10.1609/AAAI.V39I24.34779},
  url       = {https://mlanthology.org/aaai/2025/zhang2025aaai-csr/}
}