Variational Thompson Sampling for Relational Recurrent Bandits

Abstract

In this paper, we introduce a novel non-stationary bandit setting, called relational recurrent bandit, where rewards of arms at successive time steps are interdependent. The aim is to discover temporal and structural dependencies between arms in order to maximize the cumulative collected reward. Two algorithms are proposed: the first one directly models temporal dependencies between arms, as the second one assumes the existence of hidden states of the system behind the observed rewards. For both approaches, we develop a Variational Thompson Sampling method, which approximates distributions via variational inference, and uses the estimated distributions to sample reward expectations at each iteration of the process. Experiments conducted on both synthetic and real data demonstrate the effectiveness of our approaches.

Cite

Text

Lamprier et al. "Variational Thompson Sampling for Relational Recurrent Bandits." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017. doi:10.1007/978-3-319-71246-8_25

Markdown

[Lamprier et al. "Variational Thompson Sampling for Relational Recurrent Bandits." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017.](https://mlanthology.org/ecmlpkdd/2017/lamprier2017ecmlpkdd-variational/) doi:10.1007/978-3-319-71246-8_25

BibTeX

@inproceedings{lamprier2017ecmlpkdd-variational,
  title     = {{Variational Thompson Sampling for Relational Recurrent Bandits}},
  author    = {Lamprier, Sylvain and Gisselbrecht, Thibault and Gallinari, Patrick},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2017},
  pages     = {405-421},
  doi       = {10.1007/978-3-319-71246-8_25},
  url       = {https://mlanthology.org/ecmlpkdd/2017/lamprier2017ecmlpkdd-variational/}
}