PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits

Abstract

We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits.

Cite

Text

Dumitrascu et al. "PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits." Neural Information Processing Systems, 2018.

Markdown

[Dumitrascu et al. "PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/dumitrascu2018neurips-pgts/)

BibTeX

@inproceedings{dumitrascu2018neurips-pgts,
  title     = {{PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits}},
  author    = {Dumitrascu, Bianca and Feng, Karen and Engelhardt, Barbara},
  booktitle = {Neural Information Processing Systems},
  year      = {2018},
  pages     = {4624-4633},
  url       = {https://mlanthology.org/neurips/2018/dumitrascu2018neurips-pgts/}
}