Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
Abstract
We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $\pi$ using offline data collected by running a logging policy $\mu$. Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order $\Omega(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.
Cite
Text
Yin and Wang. "Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning." Artificial Intelligence and Statistics, 2020.Markdown
[Yin and Wang. "Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning." Artificial Intelligence and Statistics, 2020.](https://mlanthology.org/aistats/2020/yin2020aistats-asymptotically/)BibTeX
@inproceedings{yin2020aistats-asymptotically,
title = {{Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning}},
author = {Yin, Ming and Wang, Yu-Xiang},
booktitle = {Artificial Intelligence and Statistics},
year = {2020},
pages = {3948-3958},
volume = {108},
url = {https://mlanthology.org/aistats/2020/yin2020aistats-asymptotically/}
}