Reward Model Aggregation

Zihao Wang, Chirag Nagpal, Alexander D'Amour, Victor Veitch, Sanmi Koyejo

NeurIPSW 2023

/neuripsw/2023/wang2023neuripsw-reward/

Abstract

Aligning language models requires guiding outputs towards desired properties using reward models. This paper tackles the challenge of combining multiple reward models for diverse objectives. We introduce methods for aggregating these rewards using logical operations. Experiments confirm our methods beat traditional aggregation techniques and underscore the significance of proper reference values.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Wang et al. "Reward Model Aggregation." NeurIPS 2023 Workshops: Instruction, 2023.

Markdown

[Wang et al. "Reward Model Aggregation." NeurIPS 2023 Workshops: Instruction, 2023.](https://mlanthology.org/neuripsw/2023/wang2023neuripsw-reward/)

BibTeX

@inproceedings{wang2023neuripsw-reward,
  title     = {{Reward Model Aggregation}},
  author    = {Wang, Zihao and Nagpal, Chirag and D'Amour, Alexander and Veitch, Victor and Koyejo, Sanmi},
  booktitle = {NeurIPS 2023 Workshops: Instruction},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/wang2023neuripsw-reward/}
}