West-of-N: Synthetic Preference Generation for Improved Reward Modeling

Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn

ICLRW 2024

/iclrw/2024/pace2024iclrw-westofn/

Abstract

The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improving reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Pace et al. "West-of-N: Synthetic Preference Generation for Improved Reward Modeling." ICLR 2024 Workshops: DPFM, 2024.

Markdown

[Pace et al. "West-of-N: Synthetic Preference Generation for Improved Reward Modeling." ICLR 2024 Workshops: DPFM, 2024.](https://mlanthology.org/iclrw/2024/pace2024iclrw-westofn/)

BibTeX

@inproceedings{pace2024iclrw-westofn,
  title     = {{West-of-N: Synthetic Preference Generation for Improved Reward Modeling}},
  author    = {Pace, Alizée and Mallinson, Jonathan and Malmi, Eric and Krause, Sebastian and Severyn, Aliaksei},
  booktitle = {ICLR 2024 Workshops: DPFM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/pace2024iclrw-westofn/}
}