Learning Algorithms for Markovian Bandits:\\Is Posterior Sampling More Scalable than Optimism?
Abstract
In this paper, we study the scalability of model-based algorithms learning the optimal policy of a discounted \blue{rested} Markovian bandit problem with $n$ arms. There are two categories of model-based reinforcement learning algorithms: Bayesian algorithms (like PSRL), and optimistic algorithms (like UCRL2 or UCBVI). A naive application of these algorithms is not scalable because the state-space is exponential in $n$. In this paper, we construct variants of these algorithms specially tailored to Markovian bandits (MB) that we call MB-PSRL, MB-UCRL2, and MB-UCBVI. \blue{We consider an episodic setting with geometrically distributed episode length, and measure the performance of the algorithm in terms of regret (Bayesian regret for MB-PSRL and expected regret for MB-UCRL2 and MB-UCBVI)}. We prove that, for this setting, all algorithms have a low regret in $\tilde{O}(S\sqrt{nK})$ -- where $K$ is the number of episodes, $n$ is the number of arms and $S$ is the number of states of each arm. Up to a factor $\sqrt{S}$, these regrets match the \blue{Bayesian minimax regret} lower bound of $\Omega(\sqrt{SnK})$ that we also derive. Even if their theoretical regrets are comparable, the {\it time complexities} of these algorithms vary greatly: We show that MB-UCRL2, as well as all algorithms that use bonuses on transition matrices have a time complexity that grows exponentially in $n$. In contrast, MB-UCBVI does not use bonuses on transition matrices and we show that it can be implemented efficiently, with a time complexity linear in $n$. Our numerical experiments show, however, that its empirical regret is large. Our Bayesian algorithm, MB-PSRL, enjoys the best of both worlds: its running time is linear in the number of arms and its empirical regret is the smallest of all algorithms. This is a new addition in the understanding of the power of Bayesian algorithms, that can often be tailored to the structure of the problems to learn.
Cite
Text
Gast et al. "Learning Algorithms for Markovian Bandits:\\Is Posterior Sampling More Scalable than Optimism?." Transactions on Machine Learning Research, 2022.Markdown
[Gast et al. "Learning Algorithms for Markovian Bandits:\\Is Posterior Sampling More Scalable than Optimism?." Transactions on Machine Learning Research, 2022.](https://mlanthology.org/tmlr/2022/gast2022tmlr-learning/)BibTeX
@article{gast2022tmlr-learning,
title = {{Learning Algorithms for Markovian Bandits:\\Is Posterior Sampling More Scalable than Optimism?}},
author = {Gast, Nicolas and Gaujal, Bruno and Khun, Kimang},
journal = {Transactions on Machine Learning Research},
year = {2022},
url = {https://mlanthology.org/tmlr/2022/gast2022tmlr-learning/}
}