Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations
Abstract
Aligning Large Language Models (LLMs) to human preferences is essential for their effective deployment in real-world applications. Traditional post-training methods, such as Reinforcement Learning with Human Feedback (RLHF), are resource-intensive and time-consuming, especially as model sizes continue to grow. Recently, inference-time alignment methods have gained significant attention, as they can steer the LLM output without direct fine-tuning, and can be integrated with post-training techniques to further enhance performance. Additionally, these methods enable personalization, allowing models to adapt dynamically to user preferences and specific task requirements. However, these approaches operate in a one-shot manner, limiting policy improvement to a single round. To address this limitation, we introduce inference-time Successive Policy Iterations (SPI), a novel algorithm that enables successive policy improvement at inference time. Specifically, inference-time SPI iteratively learns value functions and leverages them to guide the LLM through a search-based optimization process. Theoretically, our algorithm is equivalent to performing multi-iteration policy optimization on the base model, effectively improving its behavior without direct fine-tuning. Experimental results demonstrate that inference-time SPI significantly improves length-control win rates on challenging instruction-following benchmarks, such as AlpacaEval 2.0, achieving a substantial performance boost (e.g., $30.71\% \to 43.80\%$ for \texttt{Llama-3-8B-Instruct} compare against GPT-4 responses). Furthermore, inference-time SPI consistently outperforms existing test-time alignment baselines such as Best-of-N (BoN), weak to strong search, which is effective for inference time scaling on different tasks.
Cite
Text
Zhang et al. "Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.Markdown
[Zhang et al. "Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/zhang2025iclrw-reinforcement/)BibTeX
@inproceedings{zhang2025iclrw-reinforcement,
title = {{Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations}},
author = {Zhang, Xinnan and Li, Chenliang and Zeng, Siliang and Li, Jiaxiang and Wang, Zhongruo and Lu, Songtao and Garcia, Alfredo and Hong, Mingyi},
booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/zhang2025iclrw-reinforcement/}
}