Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
Abstract
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as \textit{reward hacking}. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization (e.g., KL regularization), which ensures that the language model remains close to the reference model. In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques. We evaluate RBoN on the AlpacaFarm and Anthropic's hh-rlhf datasets and show that it outperforms BoN. As an application of RBoN, we use RBoN to generate a pairwise preference learning dataset. Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN.
Cite
Text
Jinnai et al. "Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment." ICML 2024 Workshops: MFHAIA, 2024.Markdown
[Jinnai et al. "Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment." ICML 2024 Workshops: MFHAIA, 2024.](https://mlanthology.org/icmlw/2024/jinnai2024icmlw-regularized/)BibTeX
@inproceedings{jinnai2024icmlw-regularized,
title = {{Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment}},
author = {Jinnai, Yuu and Morimura, Tetsuro and Ariu, Kaito and Abe, Kenshi},
booktitle = {ICML 2024 Workshops: MFHAIA},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/jinnai2024icmlw-regularized/}
}