Scaling up, Speeding up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

Sun, Shengyin; Li, Yiming; Li, Xing; Lian, Yingzhao; Lin, Weizhe; Zhen, Hui-Ling; Yang, Zhiyuan; Yu, Xianzhi; Chen, Chen; Yuan, Mingxuan; Ma, Chen

Scaling up, Speeding up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

Shengyin Sun, Yiming Li, Xing Li, Yingzhao Lian, Weizhe Lin, Hui-Ling Zhen, Zhiyuan Yang, Xianzhi Yu, Chen Chen, Mingxuan Yuan, Chen Ma

ICLR 2026

/iclr/2026/sun2026iclr-scaling/

Abstract

Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of different reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured and repetition-rich context remains unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods in LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and N-gram-based methods. Extensive experiments reveal that simple N-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating N-gram-based methods with model-based or training-based approaches to benefit both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths. Code available at <https://github.com/sunshy-1/SpecTTS-Bench>.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Sun et al. "Scaling up, Speeding up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling." International Conference on Learning Representations, 2026.

Markdown

[Sun et al. "Scaling up, Speeding up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/sun2026iclr-scaling/)

BibTeX

@inproceedings{sun2026iclr-scaling,
  title     = {{Scaling up, Speeding up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling}},
  author    = {Sun, Shengyin and Li, Yiming and Li, Xing and Lian, Yingzhao and Lin, Weizhe and Zhen, Hui-Ling and Yang, Zhiyuan and Yu, Xianzhi and Chen, Chen and Yuan, Mingxuan and Ma, Chen},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/sun2026iclr-scaling/}
}