InfAlign: Inference-Aware Language Model Alignment

Abstract

Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-$N$ , controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-$N$ sampling and best-of-$N$ jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

Cite

Text

Balashankar et al. "InfAlign: Inference-Aware Language Model Alignment." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Balashankar et al. "InfAlign: Inference-Aware Language Model Alignment." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/balashankar2025icml-infalign/)

BibTeX

@inproceedings{balashankar2025icml-infalign,
  title     = {{InfAlign: Inference-Aware Language Model Alignment}},
  author    = {Balashankar, Ananth and Sun, Ziteng and Berant, Jonathan and Eisenstein, Jacob and Collins, Michael and Hutter, Adrian and Lee, Jong and Nagpal, Chirag and Prost, Flavien and Sinha, Aradhana and Suresh, Ananda Theertha and Beirami, Ahmad},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {2646-2672},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/balashankar2025icml-infalign/}
}