Near-Optimal Reinforcement Learning in Polynomial Time
Abstract
We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.
Cite
Text
Kearns and Singh. "Near-Optimal Reinforcement Learning in Polynomial Time." Machine Learning, 2002. doi:10.1023/A:1017984413808Markdown
[Kearns and Singh. "Near-Optimal Reinforcement Learning in Polynomial Time." Machine Learning, 2002.](https://mlanthology.org/mlj/2002/kearns2002mlj-nearoptimal/) doi:10.1023/A:1017984413808BibTeX
@article{kearns2002mlj-nearoptimal,
title = {{Near-Optimal Reinforcement Learning in Polynomial Time}},
author = {Kearns, Michael J. and Singh, Satinder},
journal = {Machine Learning},
year = {2002},
pages = {209-232},
doi = {10.1023/A:1017984413808},
volume = {49},
url = {https://mlanthology.org/mlj/2002/kearns2002mlj-nearoptimal/}
}