Universal Off-Policy Evaluation
Abstract
When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a 'universal off-policy estimator' (UnO)---one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss UnO's applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.
Cite
Text
Chandak et al. "Universal Off-Policy Evaluation." Neural Information Processing Systems, 2021.Markdown
[Chandak et al. "Universal Off-Policy Evaluation." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/chandak2021neurips-universal/)BibTeX
@inproceedings{chandak2021neurips-universal,
title = {{Universal Off-Policy Evaluation}},
author = {Chandak, Yash and Niekum, Scott and da Silva, Bruno and Learned-Miller, Erik and Brunskill, Emma and Thomas, Philip S.},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/chandak2021neurips-universal/}
}