Perturbed-History Exploration in Stochastic Multi-Armed Bandits
Abstract
We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds O(t) i.i.d. pseudo-rewards to its history in round t and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the n-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.
Cite
Text
Kveton et al. "Perturbed-History Exploration in Stochastic Multi-Armed Bandits." International Joint Conference on Artificial Intelligence, 2019. doi:10.24963/IJCAI.2019/386Markdown
[Kveton et al. "Perturbed-History Exploration in Stochastic Multi-Armed Bandits." International Joint Conference on Artificial Intelligence, 2019.](https://mlanthology.org/ijcai/2019/kveton2019ijcai-perturbed/) doi:10.24963/IJCAI.2019/386BibTeX
@inproceedings{kveton2019ijcai-perturbed,
title = {{Perturbed-History Exploration in Stochastic Multi-Armed Bandits}},
author = {Kveton, Branislav and Szepesvári, Csaba and Ghavamzadeh, Mohammad and Boutilier, Craig},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2019},
pages = {2786-2793},
doi = {10.24963/IJCAI.2019/386},
url = {https://mlanthology.org/ijcai/2019/kveton2019ijcai-perturbed/}
}