Multi-Branch Self-Drafting for LLM Inference Acceleration
Abstract
The autoregressive decoding paradigm endows large language models (LLMs) with superior language generation capabilities; however, its step-by-step decoding process inherently limits decoding speed. To mitigate these constraints, the prevalent “draft and validation” strategy enables parallel validation of candidate drafts, allowing LLMs to decode multiple tokens simultaneously during one model forward propagation. However, existing methodologies for obtaining drafts often incur additional overhead in communication or training process, or statistical biases from the corpus. To this end, we propose an innovative draft generation and maintenance approach that leverages the capabilities of LLM itself. Specifically, we extend the autoregressive decoding paradigm to a multi-branch drafting procedure, which can efficiently generate draft sequences without any additional models or training process, while preserving the quality of the generated content by maintaining LLM parameters. Experiments across various open-source benchmarks show that our method generates 2.0 to 3.2 tokens per forward step and achieves around 2 times improvement of end-to-end throughput compared to the autoregressive decoding strategy.
Cite
Text
Gao et al. "Multi-Branch Self-Drafting for LLM Inference Acceleration." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I22.34567Markdown
[Gao et al. "Multi-Branch Self-Drafting for LLM Inference Acceleration." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/gao2025aaai-multi/) doi:10.1609/AAAI.V39I22.34567BibTeX
@inproceedings{gao2025aaai-multi,
title = {{Multi-Branch Self-Drafting for LLM Inference Acceleration}},
author = {Gao, Zipeng and Xia, Qingrong and Xu, Tong and Duan, Xinyu and Zheng, Zhi and Wang, Zhefeng and Chen, Enhong},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {23942-23950},
doi = {10.1609/AAAI.V39I22.34567},
url = {https://mlanthology.org/aaai/2025/gao2025aaai-multi/}
}