KV Prediction for Improved Time to First Token
Abstract
Inference with transformer models begins with a prompt processing step. This prompt processing step can be computationally expensive, taking up to 10s of seconds for billion-parameter models on edge devices. This introduces significant latency for the end user. To reduce the time spent producing the first output (known as the "time to first token", or TTFT) of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query the auxiliary model again. Our method produces a pareto-optimal efficiency-accuracy trade-off when compared to baselines. On TriviaQA, we demonstrate relative accuracy improvements in the range of 15-50% across a range of TTFT FLOPs budgets. We also demonstrate accuracy improvements of up to 30% on HumanEval python code completion. Additionally, we benchmark models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs translates to a TTFT speedup on hardware. We release our code at https://github.com/apple/corenet/tree/main/projects/kv-prediction .
Cite
Text
Horton et al. "KV Prediction for Improved Time to First Token." ICLR 2025 Workshops: SCOPE, 2025.Markdown
[Horton et al. "KV Prediction for Improved Time to First Token." ICLR 2025 Workshops: SCOPE, 2025.](https://mlanthology.org/iclrw/2025/horton2025iclrw-kv/)BibTeX
@inproceedings{horton2025iclrw-kv,
title = {{KV Prediction for Improved Time to First Token}},
author = {Horton, Maxwell and Cao, Qingqing and Sun, Chenfan and Jin, Yanzi and Mehta, Sachin and Rastegari, Mohammad and Nabi, Moin},
booktitle = {ICLR 2025 Workshops: SCOPE},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/horton2025iclrw-kv/}
}