Compute or Load KV Cache? Why Not Both?

Jin, Shuowei; Liu, Xueshen; Zhang, Qingzhao; Mao, Zhuoqing

Compute or Load KV Cache? Why Not Both?

Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Zhuoqing Mao

ICML 2025 pp. 28031-28043

/icml/2025/jin2025icml-compute/

Abstract

Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6$\times$ reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Jin et al. "Compute or Load KV Cache? Why Not Both?." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Jin et al. "Compute or Load KV Cache? Why Not Both?." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/jin2025icml-compute/)

BibTeX

@inproceedings{jin2025icml-compute,
  title     = {{Compute or Load KV Cache? Why Not Both?}},
  author    = {Jin, Shuowei and Liu, Xueshen and Zhang, Qingzhao and Mao, Zhuoqing},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {28031-28043},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/jin2025icml-compute/}
}