Alleviating Shifted Distribution in Human Preference Alignment Through Meta-Learning

Abstract

The capability of the reward model (RM) is crucial for the success of Reinforcement Learning from Human Feedback (RLHF) in aligning with human preferences. However, as training progresses, the output space distribution of the policy model shifts. The RM, initially trained on responses sampled from the output distribution of the early policy model, gradually loses its ability to distinguish between responses from the newly shifted distribution. This issue is further compounded when the RM, trained on a specific data distribution, struggles to generalize to examples outside of that distribution. These two issues can be united as a challenge posed by the shifted distribution of the environment. To surmount this challenge, we introduce MetaRM, a novel method leveraging meta-learning to adapt the RM to the shifted environment distribution. MetaRM optimizes the RM in an alternating way, by preserving both the preferences of the original preference pairs, as well as maximizing discrimination power over new examples of the shifted distribution. Extensive experiments demonstrate that MetaRM can iteratively enhance the performance of human preference alignment by improving the RM's capacity to identify subtle differences in samples of shifted distributions.

Cite

Text

Dou et al. "Alleviating Shifted Distribution in Human Preference Alignment Through Meta-Learning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I22.34552

Markdown

[Dou et al. "Alleviating Shifted Distribution in Human Preference Alignment Through Meta-Learning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/dou2025aaai-alleviating/) doi:10.1609/AAAI.V39I22.34552

BibTeX

@inproceedings{dou2025aaai-alleviating,
  title     = {{Alleviating Shifted Distribution in Human Preference Alignment Through Meta-Learning}},
  author    = {Dou, Shihan and Liu, Yan and Zhou, Enyu and Gao, Songyang and Li, Tianlong and Xiong, Limao and Zhao, Xin and Jia, Haoxiang and Ye, Junjie and Zheng, Rui and Gui, Tao and Zhang, Qi and Huang, Xuanjing},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {23805-23813},
  doi       = {10.1609/AAAI.V39I22.34552},
  url       = {https://mlanthology.org/aaai/2025/dou2025aaai-alleviating/}
}