SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Mucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, Furong Huang

ICMLW 2024

/icmlw/2024/ding2024icmlw-sail/

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. Current offline RLHF methods rely on fixed preference datasets, which can lead to sub-optimal performance. Current online RLHF methods lack a unified conceptual formulation and suffer from distribution shifts. We establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment. Thus, we perform alignment in an online and self-improving manner and generalize prior online RLHF methods as special cases. We significantly improve alignment performance on open-sourced datasets with minimal computational overhead.

PDF ICMLW OpenReview Semantic Scholar

Cite

Text

Ding et al. "SAIL: Self-Improving Efficient Online Alignment of Large Language Models." ICML 2024 Workshops: TF2M, 2024.

Markdown

[Ding et al. "SAIL: Self-Improving Efficient Online Alignment of Large Language Models." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/ding2024icmlw-sail/)

BibTeX

@inproceedings{ding2024icmlw-sail,
  title     = {{SAIL: Self-Improving Efficient Online Alignment of Large Language Models}},
  author    = {Ding, Mucong and Chakraborty, Souradip and Agrawal, Vibhu and Che, Zora and Koppel, Alec and Wang, Mengdi and Bedi, Amrit and Huang, Furong},
  booktitle = {ICML 2024 Workshops: TF2M},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/ding2024icmlw-sail/}
}