S2L-RM: Short-to-Long Reward Modeling
Abstract
Preference tuning has been effective in aligning language models with human values, often relying on reward models to annotate preferences for generated responses. However, extending this stage to long context language models requires reward models capable of accurately evaluating responses of long context tasks — a challenge that current models struggle to address despite their expanded context windows. We introduce S2L-RM, an approach that leverages short context reward models to assess the responses of long context tasks. Our method employs a factual verifier to select responses within a trust region relative to a reference response. These responses are then evaluated using any short context reward model, with input limited to a short query, the reference response, and the model-generated response. Our preliminary experiments demonstrate that our approach can accurately provide preference annotations in long-context scenarios.
Cite
Text
Chen et al. "S2L-RM: Short-to-Long Reward Modeling." NeurIPS 2024 Workshops: LanGame, 2024.Markdown
[Chen et al. "S2L-RM: Short-to-Long Reward Modeling." NeurIPS 2024 Workshops: LanGame, 2024.](https://mlanthology.org/neuripsw/2024/chen2024neuripsw-s2lrm/)BibTeX
@inproceedings{chen2024neuripsw-s2lrm,
title = {{S2L-RM: Short-to-Long Reward Modeling}},
author = {Chen, Changyu and Liu, Zichen and Wang, Haonan and Du, Chao and Pang, Tianyu and Liu, Qian and Sinha, Arunesh and Varakantham, Pradeep and Lin, Min},
booktitle = {NeurIPS 2024 Workshops: LanGame},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/chen2024neuripsw-s2lrm/}
}