Failure Modes of Learning Reward Models for LLMs and Other Sequence Models

Abstract

To align large language models (LLMs) and other sequence-based models with human values, we typically assume that human preferences can be well represented using a "reward model". We infer the parameters of this reward model from data, and then train our models to maximize reward. Effective alignment with this approach relies on a strong reward model, and reward modeling becomes increasingly important as the dominion of deployed models grows. Yet in practice, we often assume the existence of a particular reward model, without regard to its potential shortcomings. In this preliminary work, I survey several failure modes of learned reward models, which may be organized into three broad categories: model misspecification, ambiguous preferences, and reward misgeneralization. Several avenues for future work are identified. It is likely that I have missed several points and related works; to that end, I greatly appreciate your correspondence.

Cite

Text

Pitis. "Failure Modes of Learning Reward Models for LLMs and Other Sequence Models." ICML 2023 Workshops: MFPL, 2023.

Markdown

[Pitis. "Failure Modes of Learning Reward Models for LLMs and Other Sequence Models." ICML 2023 Workshops: MFPL, 2023.](https://mlanthology.org/icmlw/2023/pitis2023icmlw-failure/)

BibTeX

@inproceedings{pitis2023icmlw-failure,
  title     = {{Failure Modes of Learning Reward Models for LLMs and Other Sequence Models}},
  author    = {Pitis, Silviu},
  booktitle = {ICML 2023 Workshops: MFPL},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/pitis2023icmlw-failure/}
}