Pareto-Optimal Learning from Preferences with Hidden Context

Abstract

Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) uses human preferences to achieve this alignment. However, preferences sourced from diverse populations can result in point estimates of human values that may be sub-optimal or unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which frames discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes Lexicase selection, an iterative process to select diverse and Pareto-optimal solutions. Our empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions, effectively catering to distinct groups without access to group numbers or membership labels.

Cite

Text

Boldi et al. "Pareto-Optimal Learning from Preferences with Hidden Context." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.

Markdown

[Boldi et al. "Pareto-Optimal Learning from Preferences with Hidden Context." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.](https://mlanthology.org/neuripsw/2024/boldi2024neuripsw-paretooptimal/)

BibTeX

@inproceedings{boldi2024neuripsw-paretooptimal,
  title     = {{Pareto-Optimal Learning from Preferences with Hidden Context}},
  author    = {Boldi, Ryan and Ding, Li and Spector, Lee and Niekum, Scott},
  booktitle = {NeurIPS 2024 Workshops: Pluralistic-Alignment},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/boldi2024neuripsw-paretooptimal/}
}