RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations

Su, Zunhai; Wei, Hanyu; Chen, Zhe; Shen, Wang; Li, Linge; Yu, Huangqi; Yuan, Kehong

doi:10.24963/IJCAI.2025/690

RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations

Zunhai Su, Hanyu Wei, Zhe Chen, Wang Shen, Linge Li, Huangqi Yu, Kehong Yuan

IJCAI 2025 pp. 6200-6208

doi:10.24963/IJCAI.2025/690 /ijcai/2025/su2025ijcai-rotatekv/

Abstract

Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs. As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression. Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths. In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations: (i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency of the fast Walsh-Hadamard transform (FWHT); (ii) Pre-RoPE Grouped-Head Rotation, which mitigates the impact of rotary position embedding (RoPE) on proposed outlier-aware rotation and further smooths outliers across heads; (iii) Attention-Sink-Aware Quantization, which leverages the massive activations to precisely identify and protect attention sinks. RotateKV achieves less than 0.3 perplexity (PPL) degradation with 2-bit quantization on WikiText-2 using LLaMA-2-13B, maintains strong CoT reasoning and long-context capabilities, with less than 1.7% degradation on GSM8K, outperforming existing methods even at lower average bit-widths. RotateKV also showcases a 3.97× reduction in peak memory usage, supports 5.75× larger batch sizes, and achieves a 2.32× speedup in decoding stage.

PDF IJCAI Semantic Scholar

Cite

Text

Su et al. "RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/690

Markdown

[Su et al. "RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/su2025ijcai-rotatekv/) doi:10.24963/IJCAI.2025/690

BibTeX

@inproceedings{su2025ijcai-rotatekv,
  title     = {{RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations}},
  author    = {Su, Zunhai and Wei, Hanyu and Chen, Zhe and Shen, Wang and Li, Linge and Yu, Huangqi and Yuan, Kehong},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {6200-6208},
  doi       = {10.24963/IJCAI.2025/690},
  url       = {https://mlanthology.org/ijcai/2025/su2025ijcai-rotatekv/}
}