Thompson Sampling Algorithms for Mean-Variance Bandits
Abstract
The multi-armed bandit (MAB) problem is a classical learning task that exemplifies the exploration-exploitation tradeoff. However, standard formulations do not take into account risk. In online decision making systems, risk is a primary concern. In this regard, the mean-variance risk measure is one of the most common objective functions. Existing algorithms for mean-variance optimization in the context of MAB problems have unrealistic assumptions on the reward distributions. We develop Thompson Sampling-style algorithms for mean-variance MAB and provide comprehensive regret analyses for Gaussian and Bernoulli bandits with fewer assumptions. Our algorithms achieve the best known regret bounds for mean-variance MABs and also attain the information-theoretic bounds in some parameter regimes. Empirical simulations show that our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
Cite
Text
Zhu and Tan. "Thompson Sampling Algorithms for Mean-Variance Bandits." International Conference on Machine Learning, 2020.Markdown
[Zhu and Tan. "Thompson Sampling Algorithms for Mean-Variance Bandits." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/zhu2020icml-thompson/)BibTeX
@inproceedings{zhu2020icml-thompson,
title = {{Thompson Sampling Algorithms for Mean-Variance Bandits}},
author = {Zhu, Qiuyu and Tan, Vincent},
booktitle = {International Conference on Machine Learning},
year = {2020},
pages = {11599-11608},
volume = {119},
url = {https://mlanthology.org/icml/2020/zhu2020icml-thompson/}
}