Multi-Branch Self-Drafting for LLM Inference Acceleration

Zipeng Gao, Qingrong Xia, Tong Xu, Xinyu Duan, Zhi Zheng, Zhefeng Wang, Enhong Chen

AAAI 2025 pp. 23942-23950

doi:10.1609/AAAI.V39I22.34567 /aaai/2025/gao2025aaai-multi/

Abstract

The autoregressive decoding paradigm endows large language models (LLMs) with superior language generation capabilities; however, its step-by-step decoding process inherently limits decoding speed. To mitigate these constraints, the prevalent “draft and validation” strategy enables parallel validation of candidate drafts, allowing LLMs to decode multiple tokens simultaneously during one model forward propagation. However, existing methodologies for obtaining drafts often incur additional overhead in communication or training process, or statistical biases from the corpus. To this end, we propose an innovative draft generation and maintenance approach that leverages the capabilities of LLM itself. Specifically, we extend the autoregressive decoding paradigm to a multi-branch drafting procedure, which can efficiently generate draft sequences without any additional models or training process, while preserving the quality of the generated content by maintaining LLM parameters. Experiments across various open-source benchmarks show that our method generates 2.0 to 3.2 tokens per forward step and achieves around 2 times improvement of end-to-end throughput compared to the autoregressive decoding strategy.

AAAI Semantic Scholar

Cite

Text

Gao et al. "Multi-Branch Self-Drafting for LLM Inference Acceleration." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I22.34567

Markdown

[Gao et al. "Multi-Branch Self-Drafting for LLM Inference Acceleration." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/gao2025aaai-multi/) doi:10.1609/AAAI.V39I22.34567

BibTeX

@inproceedings{gao2025aaai-multi,
  title     = {{Multi-Branch Self-Drafting for LLM Inference Acceleration}},
  author    = {Gao, Zipeng and Xia, Qingrong and Xu, Tong and Duan, Xinyu and Zheng, Zhi and Wang, Zhefeng and Chen, Enhong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {23942-23950},
  doi       = {10.1609/AAAI.V39I22.34567},
  url       = {https://mlanthology.org/aaai/2025/gao2025aaai-multi/}
}