Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits
Abstract
In this paper, we propose and study opportunistic bandits - a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive Upper-Confidence-Bound (AdaUCB) algorithm to adaptively balance the exploration-exploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and real-world traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuations.
Cite
Text
Wu et al. "Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits." International Conference on Machine Learning, 2018.Markdown
[Wu et al. "Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits." International Conference on Machine Learning, 2018.](https://mlanthology.org/icml/2018/wu2018icml-adaptive/)BibTeX
@inproceedings{wu2018icml-adaptive,
title = {{Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits}},
author = {Wu, Huasen and Guo, Xueying and Liu, Xin},
booktitle = {International Conference on Machine Learning},
year = {2018},
pages = {5306-5314},
volume = {80},
url = {https://mlanthology.org/icml/2018/wu2018icml-adaptive/}
}