Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding
Abstract
Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.
Cite
Text
Guan et al. "Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding." Advances in Neural Information Processing Systems, 2025.Markdown
[Guan et al. "Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/guan2025neurips-yggdrasil/)BibTeX
@inproceedings{guan2025neurips-yggdrasil,
title = {{Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding}},
author = {Guan, Yue and Yu, Changming and Fang, Shihan and Hu, Weiming and Pan, Zaifeng and Wang, Zheng and Liu, Zihan and Zhou, Yangjie and Ding, Yufei and Guo, Minyi and Leng, Jingwen},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/guan2025neurips-yggdrasil/}
}