Sequoia: Scalable and Robust Speculative Decoding

Abstract

As the usage of large language models (LLMs) grows, it becomes increasingly important to serve them quickly and efficiently. While speculative decoding has recently emerged as a promising direction for accelerating LLM serving, existing methods are limited in their ability to scale to larger speculation budgets and adapt to different hyperparameters. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. To improve scalability, Sequoia introduces a dynamic programming algorithm to find an optimal tree structure for the speculated tokens. To achieve robust speculative decoding, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU by up to $4.04\times$, $3.73\times$, and $2.27 \times$. To serve Llama3-70B-Instruct on a single L40 GPU through offloading, Sequoia reduces the per-token decoding latency to 0.60 s/token, $9.5\times$ faster than DeepSpeed-Zero-Inference.

Cite

Text

Chen et al. "Sequoia: Scalable and Robust Speculative Decoding." Neural Information Processing Systems, 2024. doi:10.52202/079017-4116

Markdown

[Chen et al. "Sequoia: Scalable and Robust Speculative Decoding." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/chen2024neurips-sequoia/) doi:10.52202/079017-4116

BibTeX

@inproceedings{chen2024neurips-sequoia,
  title     = {{Sequoia: Scalable and Robust Speculative Decoding}},
  author    = {Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4116},
  url       = {https://mlanthology.org/neurips/2024/chen2024neurips-sequoia/}
}