SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Abstract

Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to efficiently assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of cross-domain reasoning benchmarks, SpecReason achieves 1.4-3.0$\times$ speedup over vanilla LRM inference while improving accuracy by 0.4-9.0%. Compared to speculative decoding without SpecReason, their combination yields an additional 8.8-58.0% latency reduction. We open-source SpecReason at \url{https://anonymous.4open.science/r/specreason/}.

Cite

Text

Pan et al. "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Pan et al. "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/pan2025neurips-specreason/)

BibTeX

@inproceedings{pan2025neurips-specreason,
  title     = {{SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning}},
  author    = {Pan, Rui and Dai, Yinwei and Zhang, Zhihao and Oliaro, Gabriele and Jia, Zhihao and Netravali, Ravi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/pan2025neurips-specreason/}
}