Counterfactual Data-Fusion for Online Reinforcement Learners
Abstract
The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.
Cite
Text
Forney et al. "Counterfactual Data-Fusion for Online Reinforcement Learners." International Conference on Machine Learning, 2017.Markdown
[Forney et al. "Counterfactual Data-Fusion for Online Reinforcement Learners." International Conference on Machine Learning, 2017.](https://mlanthology.org/icml/2017/forney2017icml-counterfactual/)BibTeX
@inproceedings{forney2017icml-counterfactual,
title = {{Counterfactual Data-Fusion for Online Reinforcement Learners}},
author = {Forney, Andrew and Pearl, Judea and Bareinboim, Elias},
booktitle = {International Conference on Machine Learning},
year = {2017},
pages = {1156-1164},
volume = {70},
url = {https://mlanthology.org/icml/2017/forney2017icml-counterfactual/}
}