Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

Abstract

We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x, y)$ and speculative samples from a small auxiliary model $\pi_S(y\mid x)$. We provably approximate both the optimal tilted policy $\pi_{\beta,B}(y\mid x) \propto \pi_B(y\mid x)\exp(\beta\,r(x, y))$ of soft best-of-$n$ under the base model $\pi_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K) and across different model families, our method achieves higher accuracy than standard soft best-of-$n$ with $\pi_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $\pi_B$, while reducing end-to-end latency by up to 28%.

Cite

Text

Geuter et al. "Guided Speculative Inference for Efficient Test-Time Alignment of LLMs." International Conference on Learning Representations, 2026.

Markdown

[Geuter et al. "Guided Speculative Inference for Efficient Test-Time Alignment of LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/geuter2026iclr-guided/)

BibTeX

@inproceedings{geuter2026iclr-guided,
  title     = {{Guided Speculative Inference for Efficient Test-Time Alignment of LLMs}},
  author    = {Geuter, Jonathan and Mroueh, Youssef and Alvarez-Melis, David},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/geuter2026iclr-guided/}
}