Compute or Load KV Cache? Why Not Both?
Abstract
Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6$\times$ reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.
Cite
Text
Jin et al. "Compute or Load KV Cache? Why Not Both?." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Jin et al. "Compute or Load KV Cache? Why Not Both?." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/jin2025icml-compute/)BibTeX
@inproceedings{jin2025icml-compute,
title = {{Compute or Load KV Cache? Why Not Both?}},
author = {Jin, Shuowei and Liu, Xueshen and Zhang, Qingzhao and Mao, Zhuoqing},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {28031-28043},
volume = {267},
url = {https://mlanthology.org/icml/2025/jin2025icml-compute/}
}