Learning from Logged Implicit Exploration Data

Abstract

We provide a sound and consistent foundation for the use of \emph{nonrandom} exploration data in ``contextual bandit'' or ``partially labeled'' settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which ``offline'' data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from an Internet %online advertising company.

Cite

Text

Strehl et al. "Learning from Logged Implicit Exploration Data." Neural Information Processing Systems, 2010.

Markdown

[Strehl et al. "Learning from Logged Implicit Exploration Data." Neural Information Processing Systems, 2010.](https://mlanthology.org/neurips/2010/strehl2010neurips-learning/)

BibTeX

@inproceedings{strehl2010neurips-learning,
  title     = {{Learning from Logged Implicit Exploration Data}},
  author    = {Strehl, Alex and Langford, John and Li, Lihong and Kakade, Sham M.},
  booktitle = {Neural Information Processing Systems},
  year      = {2010},
  pages     = {2217-2225},
  url       = {https://mlanthology.org/neurips/2010/strehl2010neurips-learning/}
}