Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
Abstract
Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) --- the problem of evaluating a new policy using the historical data obtained by different behavior policies --- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $ \frac{1}{n} \sum_{t=1}^H\mathbb{E}_{\mu}\left[\frac{d_t^\pi(s_t)^2}{d_t^\mu(s_t)^2} \Var_{\mu}\left[\frac{\pi_t(a_t|s_t)}{\mu_t(a_t|s_t)}\big( V_{t+1}^\pi(s_{t+1}) + r_t\big) \middle| s_t\right]\right] + \tilde{O}(n^{-1.5}) $ where $\mu$ and $\pi$ are the logging and target policies, $d_t^{\mu}(s_t)$ and $d_t^{\pi}(s_t)$ are the marginal distribution of the state at $t$th step, $H$ is the horizon, $n$ is the sample size and $V_{t+1}^\pi$ is the value function of the MDP under $\pi$. The result matches the Cramer-Rao lower bound in [Jiang and Li, 2016] up to a multiplicative factor of $H$. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on $H$. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.
Cite
Text
Xie et al. "Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling." Neural Information Processing Systems, 2019.Markdown
[Xie et al. "Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/xie2019neurips-optimal/)BibTeX
@inproceedings{xie2019neurips-optimal,
title = {{Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling}},
author = {Xie, Tengyang and Ma, Yifei and Wang, Yu-Xiang},
booktitle = {Neural Information Processing Systems},
year = {2019},
pages = {9668-9678},
url = {https://mlanthology.org/neurips/2019/xie2019neurips-optimal/}
}