Rotting Infinitely Many-Armed Bandits
Abstract
We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T, \sqrt{T}\})$ worst-case regret lower bound where $T$ is the time horizon. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T, \sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T, T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.
Cite
Text
Kim et al. "Rotting Infinitely Many-Armed Bandits." International Conference on Machine Learning, 2022.Markdown
[Kim et al. "Rotting Infinitely Many-Armed Bandits." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/kim2022icml-rotting/)BibTeX
@inproceedings{kim2022icml-rotting,
title = {{Rotting Infinitely Many-Armed Bandits}},
author = {Kim, Jung-Hun and Vojnovic, Milan and Yun, Se-Young},
booktitle = {International Conference on Machine Learning},
year = {2022},
pages = {11229-11254},
volume = {162},
url = {https://mlanthology.org/icml/2022/kim2022icml-rotting/}
}