Reward Model Underspecification in Language Model Alignment
Abstract
Reward models play a key role in aligning language model applications towards human preferences. However, this setup can create a dynamic in which the policy model has the incentive to exploit errors in the reward model to achieve high reward. This means that the success of reward-based alignment depends on the ability of reward models to transfer to new distributions created by the aligned policy model. We show that reward models are \emph{underspecified}, in the sense that models that perform similarly in-distribution can yield very different rewards on policy model outputs. These differences propagate to the aligned policies, which we show to be heavily influenced by the random seed used during \emph{pretraining} of the reward model. We show that even a simple alignment strategy --- best-of-$n$ reranking --- creates a semi-adversarial dynamic between the policy and reward models, promoting outputs on which the reward models are more likely to disagree. Finally, we show that a simple ensembling strategy can help to address this issue.
Cite
Text
Eisenstein et al. "Reward Model Underspecification in Language Model Alignment." NeurIPS 2023 Workshops: DistShift, 2023.Markdown
[Eisenstein et al. "Reward Model Underspecification in Language Model Alignment." NeurIPS 2023 Workshops: DistShift, 2023.](https://mlanthology.org/neuripsw/2023/eisenstein2023neuripsw-reward/)BibTeX
@inproceedings{eisenstein2023neuripsw-reward,
title = {{Reward Model Underspecification in Language Model Alignment}},
author = {Eisenstein, Jacob and Berant, Jonathan and Nagpal, Chirag and Agarwal, Alekh and Beirami, Ahmad and D'Amour, Alexander Nicholas and Dvijotham, Krishnamurthy Dj and Heller, Katherine A and Pfohl, Stephen Robert and Ramachandran, Deepak},
booktitle = {NeurIPS 2023 Workshops: DistShift},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/eisenstein2023neuripsw-reward/}
}