Logarithmic Regret in Communicating MDPs: Leveraging Known Dynamics with Bandits

Abstract

We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.

Cite

Text

Saber et al. "Logarithmic Regret in Communicating MDPs: Leveraging Known Dynamics with Bandits." Proceedings of the 15th Asian Conference on Machine Learning, 2023.

Markdown

[Saber et al. "Logarithmic Regret in Communicating MDPs: Leveraging Known Dynamics with Bandits." Proceedings of the 15th Asian Conference on Machine Learning, 2023.](https://mlanthology.org/acml/2023/saber2023acml-logarithmic/)

BibTeX

@inproceedings{saber2023acml-logarithmic,
  title     = {{Logarithmic Regret in Communicating MDPs: Leveraging Known Dynamics with Bandits}},
  author    = {Saber, Hassan and Pesquerel, Fabien and Maillard, Odalric-Ambrym and Talebi, Mohammad Sadegh},
  booktitle = {Proceedings of the 15th Asian Conference on Machine Learning},
  year      = {2023},
  pages     = {1167-1182},
  volume    = {222},
  url       = {https://mlanthology.org/acml/2023/saber2023acml-logarithmic/}
}