Personalized Language Modeling from Personalized Human Feedback

Abstract

Reinforcement Learning from Human Feedback (RLHF) is the current dominating framework to fine-tune large language models to better align with human preferences. However, the underlying premise of algorithms developed under this framework can be problematic when user preferences encoded in human feedback are diverse. In this work, we aim to address this problem by developing methods for building personalized language models. We propose a general Personalized-RLHF (P-RLHF) framework, which requires one to jointly learn a user model and a language (or reward) model. We develop new learning objectives for personalized reward modeling and personalized Direct Preference Optimization. To demonstrate the efficacy of our method, we test it on real-world text summarization data with annotated preferences and annotator information. We fine-tune GPT-J 6B to obtain personalized language (and reward) models, which outperform non-personalized models in terms of aligning with individual preferences.

Cite

Text

Li et al. "Personalized Language Modeling from Personalized Human Feedback." ICLR 2024 Workshops: R2-FM, 2024.

Markdown

[Li et al. "Personalized Language Modeling from Personalized Human Feedback." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/li2024iclrw-personalized/)

BibTeX

@inproceedings{li2024iclrw-personalized,
  title     = {{Personalized Language Modeling from Personalized Human Feedback}},
  author    = {Li, Xinyu and Lipton, Zachary Chase and Leqi, Liu},
  booktitle = {ICLR 2024 Workshops: R2-FM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/li2024iclrw-personalized/}
}