Two-Step Offline Preference-Based Reinforcement Learning on Explicitly Constrained Policies

Abstract

Preference-based reinforcement learning (PBRL) in the offline setting has succeeded greatly in industrial applications such as chatbots. A two-step learning framework that learns a reward model from an offline dataset first and then optimizes a policy over the learned reward model through online reinforcement learning has been widely adopted. However, such a method faces challenges from the risk of reward hacking and the complexity of reinforcement learning. To overcome the challenge, our insight is that both challenges come from the state-actions not supported in the dataset. Such state actions are unreliable and increase the complexity of the reinforcement learning problem. Based on the insight, we develop a novel two-step learning method called PRC: preference-based reinforcement learning on explicitly constrained policies. The high-level idea is to limit the reinforcement learning agent to optimize over policies supported on an explicitly constrained action space that excludes the out-of-distribution state-actions. We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.

Cite

Text

Xu et al. "Two-Step Offline Preference-Based Reinforcement Learning on Explicitly Constrained Policies." Transactions on Machine Learning Research, 2025.

Markdown

[Xu et al. "Two-Step Offline Preference-Based Reinforcement Learning on Explicitly Constrained Policies." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/xu2025tmlr-twostep/)

BibTeX

@article{xu2025tmlr-twostep,
  title     = {{Two-Step Offline Preference-Based Reinforcement Learning on Explicitly Constrained Policies}},
  author    = {Xu, Yinglun and Suresh, Tarun and Gumaste, Rohan and Zhu, David and Li, Ruirui and Wang, Zhengyang and Jiang, Haoming and Tang, Xianfeng and Yin, Qingyu and Cheng, Monica Xiao and Zeng, Qi and Zhang, Chao and Singh, Gagandeep},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/xu2025tmlr-twostep/}
}