Thompson Sampling Algorithms for Mean-Variance Bandits

Abstract

The multi-armed bandit (MAB) problem is a classical learning task that exemplifies the exploration-exploitation tradeoff. However, standard formulations do not take into account risk. In online decision making systems, risk is a primary concern. In this regard, the mean-variance risk measure is one of the most common objective functions. Existing algorithms for mean-variance optimization in the context of MAB problems have unrealistic assumptions on the reward distributions. We develop Thompson Sampling-style algorithms for mean-variance MAB and provide comprehensive regret analyses for Gaussian and Bernoulli bandits with fewer assumptions. Our algorithms achieve the best known regret bounds for mean-variance MABs and also attain the information-theoretic bounds in some parameter regimes. Empirical simulations show that our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.

Cite

Text

Zhu and Tan. "Thompson Sampling Algorithms for Mean-Variance Bandits." International Conference on Machine Learning, 2020.

Markdown

[Zhu and Tan. "Thompson Sampling Algorithms for Mean-Variance Bandits." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/zhu2020icml-thompson/)

BibTeX

@inproceedings{zhu2020icml-thompson,
  title     = {{Thompson Sampling Algorithms for Mean-Variance Bandits}},
  author    = {Zhu, Qiuyu and Tan, Vincent},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {11599-11608},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/zhu2020icml-thompson/}
}