Towards Aligning Language Models with Textual Feedback
Abstract
We present ALT (ALignment with Textual feedback), an approach that aligns models toward certain user preferences expressed in text. We posit that text allows for an interface for users to provide richer feedback than comparative preferences. In our work, we explore the efficacy and efficiency of textual feedback across several tasks. For the task of reducing model toxicity, we show that even using rule-based feedback can reduce model toxicity 62\% more than PPO in-domain and 52\% out-of-domain. For the task of summarization, we show that \name can match the performance of PPO with only 20\% of the training samples, both in- and out-of-domain. Finally, for the task of aligning dialog to be harmless and helpful, we find that \name can effectively use textual feedback provided by a Large Language Model without the need for a reward model.
Cite
Text
Lloret et al. "Towards Aligning Language Models with Textual Feedback." ICML 2024 Workshops: MFHAIA, 2024.Markdown
[Lloret et al. "Towards Aligning Language Models with Textual Feedback." ICML 2024 Workshops: MFHAIA, 2024.](https://mlanthology.org/icmlw/2024/lloret2024icmlw-aligning/)BibTeX
@inproceedings{lloret2024icmlw-aligning,
title = {{Towards Aligning Language Models with Textual Feedback}},
author = {Lloret, Saüc Abadal and Dhuliawala, Shehzaad and Murugesan, Keerthiram and Sachan, Mrinmaya},
booktitle = {ICML 2024 Workshops: MFHAIA},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/lloret2024icmlw-aligning/}
}