Sequoia: Scalable and Robust Speculative Decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

NeurIPS 2024

doi:10.52202/079017-4116 /neurips/2024/chen2024neurips-sequoia/

Abstract

As the usage of large language models (LLMs) grows, it becomes increasingly important to serve them quickly and efficiently. While speculative decoding has recently emerged as a promising direction for accelerating LLM serving, existing methods are limited in their ability to scale to larger speculation budgets and adapt to different hyperparameters. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. To improve scalability, Sequoia introduces a dynamic programming algorithm to find an optimal tree structure for the speculated tokens. To achieve robust speculative decoding, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU by up to $4.04\times$, $3.73\times$, and $2.27 \times$. To serve Llama3-70B-Instruct on a single L40 GPU through offloading, Sequoia reduces the per-token decoding latency to 0.60 s/token, $9.5\times$ faster than DeepSpeed-Zero-Inference.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Chen et al. "Sequoia: Scalable and Robust Speculative Decoding." Neural Information Processing Systems, 2024. doi:10.52202/079017-4116

Markdown

[Chen et al. "Sequoia: Scalable and Robust Speculative Decoding." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/chen2024neurips-sequoia/) doi:10.52202/079017-4116

BibTeX

@inproceedings{chen2024neurips-sequoia,
  title     = {{Sequoia: Scalable and Robust Speculative Decoding}},
  author    = {Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4116},
  url       = {https://mlanthology.org/neurips/2024/chen2024neurips-sequoia/}
}