Rethinking Reward Modeling in Preference-Based Large Language Model Alignment
Abstract
The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear *why* this model --- originally developed for multi-player stochastic game matching --- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization, this is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of *order consistency* in reward modeling and demonstrate that the BT model possesses this property. Moreover, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using $6$ base LLMs, $2$ datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.
Cite
Text
Sun et al. "Rethinking Reward Modeling in Preference-Based Large Language Model Alignment." International Conference on Learning Representations, 2025.Markdown
[Sun et al. "Rethinking Reward Modeling in Preference-Based Large Language Model Alignment." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/sun2025iclr-rethinking/)BibTeX
@inproceedings{sun2025iclr-rethinking,
title = {{Rethinking Reward Modeling in Preference-Based Large Language Model Alignment}},
author = {Sun, Hao and Shen, Yunyi and Ton, Jean-Francois},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/sun2025iclr-rethinking/}
}