Online-to-Offline RL for Agent Alignment
Abstract
Reinforcement learning (RL) has shown remarkable success in training agents to achieve high-performing policies, particularly in domains like Game AI where simulation environments enable efficient interactions. However, despite their success in maximizing these returns, such online-trained policies often fail to align with human preferences concerning actions, styles, and values. The challenge lies in efficiently adapting these online-trained policies to align with human preferences, given the scarcity and high cost of collecting human behavior data. In this work, we formalize the problem as *online-to-offline* RL and propose ALIGNment of Game AI to Preferences (ALIGN-GAP), an innovative approach for the alignment of well-trained game agents to human preferences. Our method features a carefully designed reward model that encodes human preferences from limited offline data and incorporates curriculum-based preference learning to align RL agents with targeted human preferences. Experiments across diverse environments and preference types demonstrate the performance of ALIGN-GAP, achieving effective alignment with human preferences.
Cite
Text
Liu et al. "Online-to-Offline RL for Agent Alignment." International Conference on Learning Representations, 2025.Markdown
[Liu et al. "Online-to-Offline RL for Agent Alignment." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/liu2025iclr-onlinetooffline/)BibTeX
@inproceedings{liu2025iclr-onlinetooffline,
title = {{Online-to-Offline RL for Agent Alignment}},
author = {Liu, Xu and Fu, Haobo and Albrecht, Stefano V and Fu, Qiang and Li, Shuai},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/liu2025iclr-onlinetooffline/}
}