Analyzing Reward Functions via Trajectory Alignment

Abstract

Reward design in reinforcement learning (RL) is often overlooked, with the assumption that a well-defined reward is readily available. However, reward functions can be challenging to design and prone to reward hacking, potentially leading to unintended or dangerous consequences in real-world applications. To create safe RL agents, reward alignment is crucial. We define reward alignment as the process of designing reward functions that preserve the preferences of a human stakeholder. In practice, reward functions are designed with training performance as the primary measure of success; this measure, however, may not reflect alignment. This work studies the practical implications of reward design on alignment. Specifically, we (1) propose a reward alignment metric, the Trajectory Alignment coefficient, that measures the similarity between the preference orderings of a human stakeholder and the preference orderings induced by a reward function, (2) use this metric to quantify the prevalence and extent of misalignment in human-designed reward functions, and (3) examine how misalignment affects the efficacy of these human-designed reward functions in terms of training performance.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Muslimani et al. "Analyzing Reward Functions via Trajectory Alignment." NeurIPS 2024 Workshops: Behavioral_ML, 2024.

Markdown

[Muslimani et al. "Analyzing Reward Functions via Trajectory Alignment." NeurIPS 2024 Workshops: Behavioral_ML, 2024.](https://mlanthology.org/neuripsw/2024/muslimani2024neuripsw-analyzing/)

BibTeX

@inproceedings{muslimani2024neuripsw-analyzing,
  title     = {{Analyzing Reward Functions via Trajectory Alignment}},
  author    = {Muslimani, Calarina and Chandramouli, Suyog and Booth, Serena and Knox, W. Bradley and Taylor, Matthew E.},
  booktitle = {NeurIPS 2024 Workshops: Behavioral_ML},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/muslimani2024neuripsw-analyzing/}
}