Human-in-the-Loop: Provably Efficient Preference-Based Reinforcement Learning with General Function Approximation
Abstract
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the RL agent only receives preferences over trajectory pairs from a human overseer. The goal of the RL agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical success in various real-world applications, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. We prove that our algorithm achieves the regret bound of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.
Cite
Text
Chen et al. "Human-in-the-Loop: Provably Efficient Preference-Based Reinforcement Learning with General Function Approximation." International Conference on Machine Learning, 2022.Markdown
[Chen et al. "Human-in-the-Loop: Provably Efficient Preference-Based Reinforcement Learning with General Function Approximation." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/chen2022icml-humanintheloop/)BibTeX
@inproceedings{chen2022icml-humanintheloop,
title = {{Human-in-the-Loop: Provably Efficient Preference-Based Reinforcement Learning with General Function Approximation}},
author = {Chen, Xiaoyu and Zhong, Han and Yang, Zhuoran and Wang, Zhaoran and Wang, Liwei},
booktitle = {International Conference on Machine Learning},
year = {2022},
pages = {3773-3793},
volume = {162},
url = {https://mlanthology.org/icml/2022/chen2022icml-humanintheloop/}
}