Evolving Alignment via Asymmetric Self-Play

Abstract

Current RLHF approaches for aligning large language models (LLMs) typically assume a fixed prompt distribution, which is sub-optimal and limits the generalization capabilities for language models. To address this issue, we introduce a general framework that casts alignment as an asymmetric game between two players: (i) a creator, which strategically generates informative prompt distributions using reward signals, and (ii) a solver, which learns to produce preferred responses on prompts produced by the creator. This framework of Evolving Alignment via Asymmetric Self-Play (`eva`), results in a simple and efficient approach that can utilize any existing RLHF algorithm. eva achieves a new state of the art in widely adopted alignment benchmarks, without the need of any additional human crafted prompts, e.g., it can improve the win rate of finetuned gemma-2-9b-it on Arena-Hard from 51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7% with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and matching Claude-3-opus. Finally, we show eva is effective and robust under various ablation settings. We hope `eva` can serve as a scalable methodology for the research community to build open-ended, robust, and self-improving language agents, that align with human values.

Cite

Text

Ye et al. "Evolving Alignment via Asymmetric Self-Play." NeurIPS 2024 Workshops: LanGame, 2024.

Markdown

[Ye et al. "Evolving Alignment via Asymmetric Self-Play." NeurIPS 2024 Workshops: LanGame, 2024.](https://mlanthology.org/neuripsw/2024/ye2024neuripsw-evolving/)

BibTeX

@inproceedings{ye2024neuripsw-evolving,
  title     = {{Evolving Alignment via Asymmetric Self-Play}},
  author    = {Ye, Ziyu and Agarwal, Rishabh and Liu, Tianqi and Joshi, Rishabh and Velury, Sarmishta and Le, Quoc V and Tan, Qijun and Liu, Yuan},
  booktitle = {NeurIPS 2024 Workshops: LanGame},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/ye2024neuripsw-evolving/}
}