Generalizing Reward Modeling for Out-of-Distribution Preference Learning

Jia, Chen

doi:10.1007/978-3-031-70362-1_7

Generalizing Reward Modeling for Out-of-Distribution Preference Learning

Chen Jia

ECML-PKDD 2024 pp. 107-124

doi:10.1007/978-3-031-70362-1_7 /ecmlpkdd/2024/jia2024ecmlpkdd-generalizing/

Abstract

Preference learning (PL) with large language models (LLMs) aims to align the LLMs’ generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models (RMs) for every encountered distribution is challenging. Thus, out-of-distribution (OOD) PL is useful for enhancing LLMs’ generalization ability with limited preference feedback. This work addresses OOD PL by optimizing a general RM through a meta-learning approach. A bilevel optimization algorithm is utilized during meta-training to learn an RM that guides policy learning to align with human preferences across various distributions. When encountering a test distribution, the meta-test procedure optimizes regularized policy using the learned RM for PL. We theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. Additionally, we conduct experiments on two text generation tasks across 22 held-out data distributions and outperform various strong baselines across various evaluation metrics.

PDF ECML-PKDD Semantic Scholar

Cite

Text

Jia. "Generalizing Reward Modeling for Out-of-Distribution Preference Learning." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024. doi:10.1007/978-3-031-70362-1_7

Markdown

[Jia. "Generalizing Reward Modeling for Out-of-Distribution Preference Learning." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024.](https://mlanthology.org/ecmlpkdd/2024/jia2024ecmlpkdd-generalizing/) doi:10.1007/978-3-031-70362-1_7

BibTeX

@inproceedings{jia2024ecmlpkdd-generalizing,
  title     = {{Generalizing Reward Modeling for Out-of-Distribution Preference Learning}},
  author    = {Jia, Chen},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2024},
  pages     = {107-124},
  doi       = {10.1007/978-3-031-70362-1_7},
  url       = {https://mlanthology.org/ecmlpkdd/2024/jia2024ecmlpkdd-generalizing/}
}