Online Learning in Markov Decision Processes with Continuous Actions
Abstract
We consider the problem of online learning in a Markov decision process (MDP) with finite states but continuous actions. This generalizes both the traditional problem of learning an MDP with finite actions and states, as well as the so-called continuum-armed bandit problem which has continuous actions but with no state involved. Based on previous works for these two problems, we propose a new algorithm for our problem, which dynamically discretizes the action spaces and learns to play strategies over these discretized actions that evolve over time. Our algorithm is able to achieve a T -step regret of about the order of $T^{\frac{d+1}{d+2}}$ with high probability, where d is a newly defined near-optimality dimension we introduce to capture the hardness of learning the MDP.
Cite
Text
Hong and Lu. "Online Learning in Markov Decision Processes with Continuous Actions." International Conference on Algorithmic Learning Theory, 2015. doi:10.1007/978-3-319-24486-0_20Markdown
[Hong and Lu. "Online Learning in Markov Decision Processes with Continuous Actions." International Conference on Algorithmic Learning Theory, 2015.](https://mlanthology.org/alt/2015/hong2015alt-online/) doi:10.1007/978-3-319-24486-0_20BibTeX
@inproceedings{hong2015alt-online,
title = {{Online Learning in Markov Decision Processes with Continuous Actions}},
author = {Hong, Yi-Te and Lu, Chi-Jen},
booktitle = {International Conference on Algorithmic Learning Theory},
year = {2015},
pages = {302-316},
doi = {10.1007/978-3-319-24486-0_20},
url = {https://mlanthology.org/alt/2015/hong2015alt-online/}
}