Indirect Online Preference Optimization via Reinforcement Learning

Abstract

Human preference alignment (HPA) aims to ensure Large Language Models (LLMs) responding appropriately to meet human moral and ethical requirements. Existing methods, such as RLHF and DPO, rely heavily on high-quality human annotation, which restrict the efficiency of iterative online model refinement. To address the inefficiencies of human annotation acquisition, iterated online strategy advocates the use of fine-tuned LLMs to self-generate preference data. However, this approach is prone to distribution bias, because of differences between human and model annotations, as well as modeling errors between simulators and real-world contexts. To mitigate the impact of distribution bias, we adopt the principles of adversarial training, framing a zero-sum two-player game with a protagonist agent and an adversarial agent. With the adversarial agent challenging the alignment of protagonist agent, we continuously refine the protagonist’s performance. By utilizing min-max equilibrium and Nash equilibrium strategies, we propose Indirect Online Preference Optimization (IOPO) mechanism that enables the protagonist agent to converge without bias while maintaining linear computational complexity. Extensive experiments across three real-world datasets demonstrate that IOPO outperforms state-of-the-art alignment methods in both offline and online scenarios, evidenced by standard alignment metrics and human evaluations. This innovation reduces the time required for model iterations from months to one week, alleviates distribution shifts, and significantly cuts annotation costs.

Cite

Text

Wang et al. "Indirect Online Preference Optimization via Reinforcement Learning." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/61

Markdown

[Wang et al. "Indirect Online Preference Optimization via Reinforcement Learning." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/wang2025ijcai-indirect/) doi:10.24963/IJCAI.2025/61

BibTeX

@inproceedings{wang2025ijcai-indirect,
  title     = {{Indirect Online Preference Optimization via Reinforcement Learning}},
  author    = {Wang, En and Lin, Xingyu and Su, Du and Bao, Chenfu and Lv, Zhonghou and Yang, Funing and Xu, Yuanbo and Liu, Wenbin},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {538-546},
  doi       = {10.24963/IJCAI.2025/61},
  url       = {https://mlanthology.org/ijcai/2025/wang2025ijcai-indirect/}
}