Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

Abstract

We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $\pi$ using offline data collected by running a logging policy $\mu$. Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order $\Omega(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

Cite

Text

Yin and Wang. "Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning." Artificial Intelligence and Statistics, 2020.

Markdown

[Yin and Wang. "Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning." Artificial Intelligence and Statistics, 2020.](https://mlanthology.org/aistats/2020/yin2020aistats-asymptotically/)

BibTeX

@inproceedings{yin2020aistats-asymptotically,
  title     = {{Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning}},
  author    = {Yin, Ming and Wang, Yu-Xiang},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2020},
  pages     = {3948-3958},
  volume    = {108},
  url       = {https://mlanthology.org/aistats/2020/yin2020aistats-asymptotically/}
}