Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space
Abstract
In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the diameter $D$ of the MDP is $\Omega(S^S)$, where $S$ is the number of states. Therefore, the existing lower and upper bounds on the regret at time $T$, of order $O (\sqrt{DSAT})$ for MDPs with $S$ states and $A$ actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm UCRL2 is in fact upper bounded by $\tilde{\mathcal{O}} (\sqrt{E_2AT})$ where $E_2$ is a weighted second moment of the stationary measure of a reference policy. Importantly, $E_2$ is bounded independently of $S$. Thus, our bound is asymptotically independent of the number of states and of the diameter. This result is based on a careful study of the number of visits performed by the learning algorithm to the states of the MDP, which is highly non-uniform.
Cite
Text
Anselmi et al. "Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space." Neural Information Processing Systems, 2022.Markdown
[Anselmi et al. "Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/anselmi2022neurips-reinforcement/)BibTeX
@inproceedings{anselmi2022neurips-reinforcement,
title = {{Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space}},
author = {Anselmi, Jonatha and Gaujal, Bruno and Rebuffi, Louis-Sébastien},
booktitle = {Neural Information Processing Systems},
year = {2022},
url = {https://mlanthology.org/neurips/2022/anselmi2022neurips-reinforcement/}
}