On the Design of Estimators for Bandit Off-Policy Evaluation

Abstract

Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.

Cite

Text

Vlassis et al. "On the Design of Estimators for Bandit Off-Policy Evaluation." International Conference on Machine Learning, 2019.

Markdown

[Vlassis et al. "On the Design of Estimators for Bandit Off-Policy Evaluation." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/vlassis2019icml-design/)

BibTeX

@inproceedings{vlassis2019icml-design,
  title     = {{On the Design of Estimators for Bandit Off-Policy Evaluation}},
  author    = {Vlassis, Nikos and Bibaut, Aurelien and Dimakopoulou, Maria and Jebara, Tony},
  booktitle = {International Conference on Machine Learning},
  year      = {2019},
  pages     = {6468-6476},
  volume    = {97},
  url       = {https://mlanthology.org/icml/2019/vlassis2019icml-design/}
}