Stochastic Bandit Based on Empirical Moments

Abstract

In the multiarmed bandit problem a gambler chooses an arm of a slot machine to pull considering a tradeoff between exploration and exploitation. We study the stochastic bandit problem where each arm has a reward distribution supported in [0,1]. For this model, there exists a policy which achieves the theoretical bound asymptotically. However the optimal policy requires a computation of a convex optimization which involves the empirical distribution of each arm. In this paper, we propose a policy which exploits the first d empirical moments for arbitrary d fixed in advance. We show that the performance of the policy approaches the theoretical bound as d increases. This policy can be implemented by solving polynomial equations, which we derive the explicit solution for d smaller than 5. By choosing appropriate d, the proposed policy realizes a tradeoff between the computational complexity and the expected regret.

Cite

Text

Honda and Takemura. "Stochastic Bandit Based on Empirical Moments." Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, 2012.

Markdown

[Honda and Takemura. "Stochastic Bandit Based on Empirical Moments." Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, 2012.](https://mlanthology.org/aistats/2012/honda2012aistats-stochastic/)

BibTeX

@inproceedings{honda2012aistats-stochastic,
  title     = {{Stochastic Bandit Based on Empirical Moments}},
  author    = {Honda, Junya and Takemura, Akimichi},
  booktitle = {Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics},
  year      = {2012},
  pages     = {529-537},
  volume    = {22},
  url       = {https://mlanthology.org/aistats/2012/honda2012aistats-stochastic/}
}