KV Cache Transform Coding for Compact Storage in LLM Inference

Staniszewski, Konrad; Łańcucki, Adrian

KV Cache Transform Coding for Compact Storage in LLM Inference

ICLR 2026

/iclr/2026/staniszewski2026iclr-kv/

Abstract

Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20x compression while maintaining reasoning and long-context accuracy, and 40x or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, GSM8K, LiveCodeBench, LongBench, MATH-500, MMLU, Qasper and RULER. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Staniszewski and Łańcucki. "KV Cache Transform Coding for Compact Storage in LLM Inference." International Conference on Learning Representations, 2026.

Markdown

[Staniszewski and Łańcucki. "KV Cache Transform Coding for Compact Storage in LLM Inference." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/staniszewski2026iclr-kv/)

BibTeX

@inproceedings{staniszewski2026iclr-kv,
  title     = {{KV Cache Transform Coding for Compact Storage in LLM Inference}},
  author    = {Staniszewski, Konrad and Łańcucki, Adrian},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/staniszewski2026iclr-kv/}
}