Diverging Preferences: When Do Annotators Disagree and Do Models Know?
Abstract
We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact reward modeling. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences.
Cite
Text
Zhang et al. "Diverging Preferences: When Do Annotators Disagree and Do Models Know?." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.Markdown
[Zhang et al. "Diverging Preferences: When Do Annotators Disagree and Do Models Know?." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.](https://mlanthology.org/neuripsw/2024/zhang2024neuripsw-diverging/)BibTeX
@inproceedings{zhang2024neuripsw-diverging,
title = {{Diverging Preferences: When Do Annotators Disagree and Do Models Know?}},
author = {Zhang, Michael JQ and Wang, Zhilin and Hwang, Jena D. and Dong, Yi and Delalleau, Olivier and Choi, Yejin and Choi, Eunsol and Ren, Xiang and Pyatkin, Valentina},
booktitle = {NeurIPS 2024 Workshops: Pluralistic-Alignment},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/zhang2024neuripsw-diverging/}
}