Bayesian Reward Models for LLM Alignment

Abstract

To ensure that large language model (LLM) responses are helpful and non-toxic, a reward model trained on human preference data is usually used. LLM responses with high rewards are then selected through best-of-$n$ (BoN) sampling or the LLM is further optimized to produce responses with high rewards through reinforcement learning from human feedback (RLHF). However, these processes are susceptible to reward overoptimization or `hacking', where responses receive high rewards due to imperfections in the reward model rather than true preference, particularly as prompts or responses deviate from the training data. To address these challenges, we propose to train a Bayesian reward model, which signals higher uncertainty further from the training data distribution. We trained Bayesian reward models using Laplace approximation on LoRA weights, and found that the resulting uncertainty estimates can effectively mitigate reward overoptimization in BoN sampling.

Cite

Text

Yang et al. "Bayesian Reward Models for LLM Alignment." ICML 2024 Workshops: SPIGM, 2024.

Markdown

[Yang et al. "Bayesian Reward Models for LLM Alignment." ICML 2024 Workshops: SPIGM, 2024.](https://mlanthology.org/icmlw/2024/yang2024icmlw-bayesian/)

BibTeX

@inproceedings{yang2024icmlw-bayesian,
  title     = {{Bayesian Reward Models for LLM Alignment}},
  author    = {Yang, Adam X. and Robeyns, Maxime and Coste, Thomas and Shi, Zhengyan and Wang, Jun and Ammar, Haitham Bou and Aitchison, Laurence},
  booktitle = {ICML 2024 Workshops: SPIGM},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/yang2024icmlw-bayesian/}
}