Draft-Based Approximate Inference for LLMs
Abstract
Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) **SpecKV**, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) **SpecPC**, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) **SpecKV-PC**, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.
Cite
Text
Galim et al. "Draft-Based Approximate Inference for LLMs." International Conference on Learning Representations, 2026.Markdown
[Galim et al. "Draft-Based Approximate Inference for LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/galim2026iclr-draftbased/)BibTeX
@inproceedings{galim2026iclr-draftbased,
title = {{Draft-Based Approximate Inference for LLMs}},
author = {Galim, Kevin and Ewer, Ethan and Kang, Wonjun and Lee, Minjae and Koo, Hyung Il and Lee, Kangwook},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/galim2026iclr-draftbased/}
}