Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

Abstract

Recent work on inference-time alignment has established benefits of increasing inference-time computation in language models, but naively scaling compute through techniques like Best-of-N sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we formalize inference-time alignment as improving a pre-trained policy’s responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy’s coverage over high-quality responses for performance and compute scaling: (1) We show that Best-of-N alignment with an ideal N can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when N is large, and fails to achieve tight guarantees under more realistic coverage conditions; (2) We introduce InferenceTimePessimism, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing pessimism in the face of uncertainty; we prove that its performance is optimal and scaling-monotonic, i.e., does ot degrade as N increases. We complement our theoretical results with experiments that demonstrate the practicality of our algorithm across a variety of tasks and models.

Cite

Text

Huang et al. "Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Huang et al. "Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/huang2025icml-bestofn/)

BibTeX

@inproceedings{huang2025icml-bestofn,
  title     = {{Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment}},
  author    = {Huang, Audrey and Block, Adam and Liu, Qinghua and Jiang, Nan and Krishnamurthy, Akshay and Foster, Dylan J},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {25075-25126},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/huang2025icml-bestofn/}
}