Near-Optimal Regret Bounds for Reinforcement Learning

Abstract

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of l times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of Õ(l1/3T2/3DS√A).

Cite

Text

Jaksch et al. "Near-Optimal Regret Bounds for Reinforcement Learning." Journal of Machine Learning Research, 2010.

Markdown

[Jaksch et al. "Near-Optimal Regret Bounds for Reinforcement Learning." Journal of Machine Learning Research, 2010.](https://mlanthology.org/jmlr/2010/jaksch2010jmlr-nearoptimal/)

BibTeX

@article{jaksch2010jmlr-nearoptimal,
  title     = {{Near-Optimal Regret Bounds for Reinforcement Learning}},
  author    = {Jaksch, Thomas and Ortner, Ronald and Auer, Peter},
  journal   = {Journal of Machine Learning Research},
  year      = {2010},
  pages     = {1563-1600},
  volume    = {11},
  url       = {https://mlanthology.org/jmlr/2010/jaksch2010jmlr-nearoptimal/}
}