Multi-Step Preference Optimization via Two-Player Markov Games

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Multi-step Preference Optimization (MPO) is built upon the natural actor-critic framework. We further develop OMPO based on the optimistic online gradient descent algorithm. Theoretically, we provide a rigorous analysis for both algorithms on convergence and show that OMPO requires $\mathcal{O}(\epsilon^{-1})$ policy updates to converge to an $\epsilon$-approximate Nash equilibrium. We also validate the effectiveness of our method through experiments on the multi-turn conversations dataset in MT-bench-101.

Cite

Text

Wu et al. "Multi-Step Preference Optimization via Two-Player Markov Games." NeurIPS 2024 Workshops: LanGame, 2024.

Markdown

[Wu et al. "Multi-Step Preference Optimization via Two-Player Markov Games." NeurIPS 2024 Workshops: LanGame, 2024.](https://mlanthology.org/neuripsw/2024/wu2024neuripsw-multistep/)

BibTeX

@inproceedings{wu2024neuripsw-multistep,
  title     = {{Multi-Step Preference Optimization via Two-Player Markov Games}},
  author    = {Wu, Yongtao and Viano, Luca and Chen, Yihang and Zhu, Zhenyu and Gu, Quanquan and Cevher, Volkan},
  booktitle = {NeurIPS 2024 Workshops: LanGame},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/wu2024neuripsw-multistep/}
}