Accelerating LLM Inference with Staged Speculative Decoding
Abstract
Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.
Cite
Text
Spector and Re. "Accelerating LLM Inference with Staged Speculative Decoding." ICML 2023 Workshops: ES-FoMO, 2023.Markdown
[Spector and Re. "Accelerating LLM Inference with Staged Speculative Decoding." ICML 2023 Workshops: ES-FoMO, 2023.](https://mlanthology.org/icmlw/2023/spector2023icmlw-accelerating/)BibTeX
@inproceedings{spector2023icmlw-accelerating,
title = {{Accelerating LLM Inference with Staged Speculative Decoding}},
author = {Spector, Benjamin Frederick and Re, Christopher},
booktitle = {ICML 2023 Workshops: ES-FoMO},
year = {2023},
url = {https://mlanthology.org/icmlw/2023/spector2023icmlw-accelerating/}
}