Near-Optimal Reinforcement Learning in Factored MDPs
Abstract
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces. This implies $T = \Omega(SA)$ time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, $S$ and $A$ can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \emph{parameters} encoding the factored MDP, which may be exponentially smaller than $S$ or $A$. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).
Cite
Text
Osband and Van Roy. "Near-Optimal Reinforcement Learning in Factored MDPs." Neural Information Processing Systems, 2014.Markdown
[Osband and Van Roy. "Near-Optimal Reinforcement Learning in Factored MDPs." Neural Information Processing Systems, 2014.](https://mlanthology.org/neurips/2014/osband2014neurips-nearoptimal/)BibTeX
@inproceedings{osband2014neurips-nearoptimal,
title = {{Near-Optimal Reinforcement Learning in Factored MDPs}},
author = {Osband, Ian and Van Roy, Benjamin},
booktitle = {Neural Information Processing Systems},
year = {2014},
pages = {604-612},
url = {https://mlanthology.org/neurips/2014/osband2014neurips-nearoptimal/}
}