Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations

Abstract

Aligning Large Language Models (LLMs) to human preferences is essential for their effective deployment in real-world applications. Traditional post-training methods, such as Reinforcement Learning with Human Feedback (RLHF), are resource-intensive and time-consuming, especially as model sizes continue to grow. Recently, inference-time alignment methods have gained significant attention, as they can steer the LLM output without direct fine-tuning, and can be integrated with post-training techniques to further enhance performance. Additionally, these methods enable personalization, allowing models to adapt dynamically to user preferences and specific task requirements. However, these approaches operate in a one-shot manner, limiting policy improvement to a single round. To address this limitation, we introduce inference-time Successive Policy Iterations (SPI), a novel algorithm that enables successive policy improvement at inference time. Specifically, inference-time SPI iteratively learns value functions and leverages them to guide the LLM through a search-based optimization process. Theoretically, our algorithm is equivalent to performing multi-iteration policy optimization on the base model, effectively improving its behavior without direct fine-tuning. Experimental results demonstrate that inference-time SPI significantly improves length-control win rates on challenging instruction-following benchmarks, such as AlpacaEval 2.0, achieving a substantial performance boost (e.g., $30.71\% \to 43.80\%$ for \texttt{Llama-3-8B-Instruct} compare against GPT-4 responses). Furthermore, inference-time SPI consistently outperforms existing test-time alignment baselines such as Best-of-N (BoN), weak to strong search, which is effective for inference time scaling on different tasks.

Cite

Text

Zhang et al. "Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.

Markdown

[Zhang et al. "Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/zhang2025iclrw-reinforcement/)

BibTeX

@inproceedings{zhang2025iclrw-reinforcement,
  title     = {{Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations}},
  author    = {Zhang, Xinnan and Li, Chenliang and Zeng, Siliang and Li, Jiaxiang and Wang, Zhongruo and Lu, Songtao and Garcia, Alfredo and Hong, Mingyi},
  booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/zhang2025iclrw-reinforcement/}
}