Towards Understanding Sycophancy in Language Models
Abstract
Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models and whether human preference judgments are responsible. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy behavior across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior of RLHF models, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of RLHF models, likely driven in part by human preference judgments favoring sycophantic responses.
Cite
Text
Sharma et al. "Towards Understanding Sycophancy in Language Models." International Conference on Learning Representations, 2024.Markdown
[Sharma et al. "Towards Understanding Sycophancy in Language Models." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/sharma2024iclr-understanding/)BibTeX
@inproceedings{sharma2024iclr-understanding,
title = {{Towards Understanding Sycophancy in Language Models}},
author = {Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R and Kravec, Shauna M and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Ethan},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/sharma2024iclr-understanding/}
}