Rethinking Reward Modeling in Preference-Based Large Language Model Alignment

Sun, Hao; Shen, Yunyi; Ton, Jean-Francois

Rethinking Reward Modeling in Preference-Based Large Language Model Alignment

ICLR 2025

/iclr/2025/sun2025iclr-rethinking/

Abstract

The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear *why* this model --- originally developed for multi-player stochastic game matching --- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization, this is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of *order consistency* in reward modeling and demonstrate that the BT model possesses this property. Moreover, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using $6$ base LLMs, $2$ datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.

PDF ICLR Semantic Scholar

Cite

Text

Sun et al. "Rethinking Reward Modeling in Preference-Based Large Language Model Alignment." International Conference on Learning Representations, 2025.

Markdown

[Sun et al. "Rethinking Reward Modeling in Preference-Based Large Language Model Alignment." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/sun2025iclr-rethinking/)

BibTeX

@inproceedings{sun2025iclr-rethinking,
  title     = {{Rethinking Reward Modeling in Preference-Based Large Language Model Alignment}},
  author    = {Sun, Hao and Shen, Yunyi and Ton, Jean-Francois},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/sun2025iclr-rethinking/}
}