SAIL: Self-Improving Efficient Online Alignment of Large Language Models
Abstract
Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. Current offline RLHF methods rely on fixed preference datasets, which can lead to sub-optimal performance. Current online RLHF methods lack a unified conceptual formulation and suffer from distribution shifts. We establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment. Thus, we perform alignment in an online and self-improving manner and generalize prior online RLHF methods as special cases. We significantly improve alignment performance on open-sourced datasets with minimal computational overhead.
Cite
Text
Ding et al. "SAIL: Self-Improving Efficient Online Alignment of Large Language Models." ICML 2024 Workshops: TF2M, 2024.Markdown
[Ding et al. "SAIL: Self-Improving Efficient Online Alignment of Large Language Models." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/ding2024icmlw-sail/)BibTeX
@inproceedings{ding2024icmlw-sail,
title = {{SAIL: Self-Improving Efficient Online Alignment of Large Language Models}},
author = {Ding, Mucong and Chakraborty, Souradip and Agrawal, Vibhu and Che, Zora and Koppel, Alec and Wang, Mengdi and Bedi, Amrit and Huang, Furong},
booktitle = {ICML 2024 Workshops: TF2M},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/ding2024icmlw-sail/}
}