Group Robust Best-of-K Decoding of Language Models for Pluralistic Alignment

Abstract

The desirable behaviour of a chat agent can be described with multiple criteria, such as harmlessness, helpfulness, and conciseness, each of which can be scored by a reward model. While each user, or a group of users, may perceive each criterion with different significance, in pluralistic alignment settings, it is difficult to know how much an individual user or group would weigh one criterion over another in many practical scenarios. Instead of assuming knowledge of the weights among multiple criteria, we propose a robust alignment approach that maximises the worst-case criterion among the group of reward models. To test this approach, we use best-of-K rejection sampling to demonstrate the properties of an algorithm that employs our robust objective. Finally, we propose several interesting avenues of future exploration that may lead to more practical algorithms than group robust best-of-K rejection sampling.

Cite

Text

Yoon et al. "Group Robust Best-of-K Decoding of Language Models for Pluralistic Alignment." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.

Markdown

[Yoon et al. "Group Robust Best-of-K Decoding of Language Models for Pluralistic Alignment." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.](https://mlanthology.org/neuripsw/2024/yoon2024neuripsw-group/)

BibTeX

@inproceedings{yoon2024neuripsw-group,
  title     = {{Group Robust Best-of-K Decoding of Language Models for Pluralistic Alignment}},
  author    = {Yoon, Sangwoong and Bankes, William and Son, Seongho and Petrovic, Anja and Ramesh, Shyam Sundhar and Tang, Xiaohang and Bogunovic, Ilija},
  booktitle = {NeurIPS 2024 Workshops: Pluralistic-Alignment},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/yoon2024neuripsw-group/}
}