Nearly Horizon-Free Offline Reinforcement Learning
Abstract
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with $S$ states and $A$ actions, or linear MDP with anchor points and feature dimension $d$, given the collected $K$ episodes data with minimum visiting probability of (anchor) state-action pairs $d_m$, we obtain nearly horizon $H$-free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by 1. Specifically:• For offline policy evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \right)$ error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on $\mathrm{poly}(H, S, A, d)$ in higher-order term.• For offline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} + \frac{\min(S, d)}{Kd_m}\right)$ sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by [Cui and Yang 2020] that has additional $\mathrm{poly} (H, S, d)$ factors in the main term.To the best of our knowledge, these are the first set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a "total variance" term in the offline scenarios, which could be of individual interest.
Cite
Text
Ren et al. "Nearly Horizon-Free Offline Reinforcement Learning." Neural Information Processing Systems, 2021.Markdown
[Ren et al. "Nearly Horizon-Free Offline Reinforcement Learning." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/ren2021neurips-nearly/)BibTeX
@inproceedings{ren2021neurips-nearly,
title = {{Nearly Horizon-Free Offline Reinforcement Learning}},
author = {Ren, Tongzheng and Li, Jialian and Dai, Bo and Du, Simon S and Sanghavi, Sujay},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/ren2021neurips-nearly/}
}