Online Learning in Markov Decision Processes with Changing Cost Sequences

Abstract

In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.

Cite

Text

Dick et al. "Online Learning in Markov Decision Processes with Changing Cost Sequences." International Conference on Machine Learning, 2014.

Markdown

[Dick et al. "Online Learning in Markov Decision Processes with Changing Cost Sequences." International Conference on Machine Learning, 2014.](https://mlanthology.org/icml/2014/dick2014icml-online/)

BibTeX

@inproceedings{dick2014icml-online,
  title     = {{Online Learning in Markov Decision Processes with Changing Cost Sequences}},
  author    = {Dick, Travis and Gyorgy, Andras and Szepesvari, Csaba},
  booktitle = {International Conference on Machine Learning},
  year      = {2014},
  pages     = {512-520},
  volume    = {32},
  url       = {https://mlanthology.org/icml/2014/dick2014icml-online/}
}