On the Design of Estimators for Bandit Off-Policy Evaluation
Abstract
Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.
Cite
Text
Vlassis et al. "On the Design of Estimators for Bandit Off-Policy Evaluation." International Conference on Machine Learning, 2019.Markdown
[Vlassis et al. "On the Design of Estimators for Bandit Off-Policy Evaluation." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/vlassis2019icml-design/)BibTeX
@inproceedings{vlassis2019icml-design,
title = {{On the Design of Estimators for Bandit Off-Policy Evaluation}},
author = {Vlassis, Nikos and Bibaut, Aurelien and Dimakopoulou, Maria and Jebara, Tony},
booktitle = {International Conference on Machine Learning},
year = {2019},
pages = {6468-6476},
volume = {97},
url = {https://mlanthology.org/icml/2019/vlassis2019icml-design/}
}