In-Dataset Trajectory Return Regularization for Offline Preference-Based Reinforcement Learning

Tu, Songjun; Sun, Jingbo; Zhang, Qichao; Zhang, Yaocheng; Liu, Jia; Chen, Ke; Zhao, Dongbin

doi:10.1609/AAAI.V39I20.35388

In-Dataset Trajectory Return Regularization for Offline Preference-Based Reinforcement Learning

Songjun Tu, Jingbo Sun, Qichao Zhang, Yaocheng Zhang, Jia Liu, Ke Chen, Dongbin Zhao

AAAI 2025 pp. 20929-20937

doi:10.1609/AAAI.V39I20.35388 /aaai/2025/tu2025aaai-dataset/

Abstract

Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a balance between maintaining fidelity to the behavior policy with high in-dataset trajectory returns and selecting optimal actions based on high reward labels. Additionally, we introduce an ensemble normalization technique that effectively integrates multiple reward models, balancing the trade-off between reward differentiation and accuracy. Empirical evaluations on various benchmarks demonstrate the superiority of DTR over other state-of-the-art baselines.

PDF AAAI Semantic Scholar

Cite

Text

Tu et al. "In-Dataset Trajectory Return Regularization for Offline Preference-Based Reinforcement Learning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I20.35388

Markdown

[Tu et al. "In-Dataset Trajectory Return Regularization for Offline Preference-Based Reinforcement Learning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/tu2025aaai-dataset/) doi:10.1609/AAAI.V39I20.35388

BibTeX

@inproceedings{tu2025aaai-dataset,
  title     = {{In-Dataset Trajectory Return Regularization for Offline Preference-Based Reinforcement Learning}},
  author    = {Tu, Songjun and Sun, Jingbo and Zhang, Qichao and Zhang, Yaocheng and Liu, Jia and Chen, Ke and Zhao, Dongbin},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {20929-20937},
  doi       = {10.1609/AAAI.V39I20.35388},
  url       = {https://mlanthology.org/aaai/2025/tu2025aaai-dataset/}
}