Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
Abstract
The canonical setup of learning a reward model (RM) from human preferences with binary feedback discards potentially useful samples (such as "tied" between the two responses) and loses fine-grained information (such as "slightly better’"). This paper proposes a framework for learning RMs under ordinal feedback, generalizing the binary feedback to arbitrary granularity. We first identify a marginal unbiasedness condition, which generalizes the existing assumption of the binary feedback. The condition is validated via the sociological concept called "wisdom of the crowd". Under this condition, we develop a natural probability model and prove the benefits of fine-grained feedback in terms of reducing the Rademacher complexity, which may be of independent interest to another problem: the bias-variance trade-off in knowledge distillation. The framework also sheds light on designing guidelines for human annotators. Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning.
Cite
Text
Liu et al. "Reward Modeling with Ordinal Feedback: Wisdom of the Crowd." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Liu et al. "Reward Modeling with Ordinal Feedback: Wisdom of the Crowd." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/liu2025icml-reward/)BibTeX
@inproceedings{liu2025icml-reward,
title = {{Reward Modeling with Ordinal Feedback: Wisdom of the Crowd}},
author = {Liu, Shang and Pan, Yu and Chen, Guanting and Li, Xiaocheng},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {39190-39218},
volume = {267},
url = {https://mlanthology.org/icml/2025/liu2025icml-reward/}
}