Online Learning in Markov Decision Processes with Continuous Actions

Abstract

We consider the problem of online learning in a Markov decision process (MDP) with finite states but continuous actions. This generalizes both the traditional problem of learning an MDP with finite actions and states, as well as the so-called continuum-armed bandit problem which has continuous actions but with no state involved. Based on previous works for these two problems, we propose a new algorithm for our problem, which dynamically discretizes the action spaces and learns to play strategies over these discretized actions that evolve over time. Our algorithm is able to achieve a T -step regret of about the order of $T^{\frac{d+1}{d+2}}$ with high probability, where d is a newly defined near-optimality dimension we introduce to capture the hardness of learning the MDP.

Cite

Text

Hong and Lu. "Online Learning in Markov Decision Processes with Continuous Actions." International Conference on Algorithmic Learning Theory, 2015. doi:10.1007/978-3-319-24486-0_20

Markdown

[Hong and Lu. "Online Learning in Markov Decision Processes with Continuous Actions." International Conference on Algorithmic Learning Theory, 2015.](https://mlanthology.org/alt/2015/hong2015alt-online/) doi:10.1007/978-3-319-24486-0_20

BibTeX

@inproceedings{hong2015alt-online,
  title     = {{Online Learning in Markov Decision Processes with Continuous Actions}},
  author    = {Hong, Yi-Te and Lu, Chi-Jen},
  booktitle = {International Conference on Algorithmic Learning Theory},
  year      = {2015},
  pages     = {302-316},
  doi       = {10.1007/978-3-319-24486-0_20},
  url       = {https://mlanthology.org/alt/2015/hong2015alt-online/}
}