Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
Abstract
We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches\[O\left(\sum_h=1^H\sum_s_h,a_hd^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_s_h,a_h}{(V^\star_h+1+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}n}\right).\]We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on $\mu$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.
Cite
Text
Yin and Wang. "Towards Instance-Optimal Offline Reinforcement Learning with Pessimism." Neural Information Processing Systems, 2021.Markdown
[Yin and Wang. "Towards Instance-Optimal Offline Reinforcement Learning with Pessimism." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/yin2021neurips-instanceoptimal/)BibTeX
@inproceedings{yin2021neurips-instanceoptimal,
title = {{Towards Instance-Optimal Offline Reinforcement Learning with Pessimism}},
author = {Yin, Ming and Wang, Yu-Xiang},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/yin2021neurips-instanceoptimal/}
}