Bayesian Counterfactual Risk Minimization

Abstract

We present a Bayesian view of counterfactual risk minimization (CRM) for offline learning from logged bandit feedback. Using PAC-Bayesian analysis, we derive a new generalization bound for the truncated inverse propensity score estimator. We apply the bound to a class of Bayesian policies, which motivates a novel, potentially data-dependent, regularization technique for CRM. Experimental results indicate that this technique outperforms standard $L_2$ regularization, and that it is competitive with variance regularization while being both simpler to implement and more computationally efficient.

Cite

Text

London and Sandler. "Bayesian Counterfactual Risk Minimization." International Conference on Machine Learning, 2019.

Markdown

[London and Sandler. "Bayesian Counterfactual Risk Minimization." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/london2019icml-bayesian/)

BibTeX

@inproceedings{london2019icml-bayesian,
  title     = {{Bayesian Counterfactual Risk Minimization}},
  author    = {London, Ben and Sandler, Ted},
  booktitle = {International Conference on Machine Learning},
  year      = {2019},
  pages     = {4125-4133},
  volume    = {97},
  url       = {https://mlanthology.org/icml/2019/london2019icml-bayesian/}
}