Profitable Bandits
Abstract
Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the \emph{profitable bandit problem} here. At each step, an agent chooses a subset of the $K\geq 1$ possible actions. For each action chosen, she then respectively pays and receives the sum of a random number of costs and rewards. Her objective is to maximize her cumulated profit. We adapt and study three well-known strategies in this purpose, that were proved to be most efficient in other settings: \textsc{kl-UCB}, \textsc{Bayes-UCB} and \textsc{Thompson Sampling}. For each of them, we prove a finite time regret bound which, together with a lower bound we obtain as well, establishes asymptotic optimality in some cases. Our goal is also to \emph{compare} these three strategies from a theoretical and empirical perspective both at the same time. We give simple, self-contained proofs that emphasize their similarities, as well as their differences. While both Bayesian strategies are automatically adapted to the geometry of information, the numerical experiments carried out show a slight advantage for \textsc{Thompson Sampling} in practice.
Cite
Text
Achab et al. "Profitable Bandits." Proceedings of The 10th Asian Conference on Machine Learning, 2018.Markdown
[Achab et al. "Profitable Bandits." Proceedings of The 10th Asian Conference on Machine Learning, 2018.](https://mlanthology.org/acml/2018/achab2018acml-profitable/)BibTeX
@inproceedings{achab2018acml-profitable,
title = {{Profitable Bandits}},
author = {Achab, Mastane and Clémençon, Stephan and Garivier, Aurélien},
booktitle = {Proceedings of The 10th Asian Conference on Machine Learning},
year = {2018},
pages = {694-709},
volume = {95},
url = {https://mlanthology.org/acml/2018/achab2018acml-profitable/}
}