Near-Optimal Reinforcement Learning in Polynomial Time

Abstract

We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

Cite

Text

Kearns and Singh. "Near-Optimal Reinforcement Learning in Polynomial Time." Machine Learning, 2002. doi:10.1023/A:1017984413808

Markdown

[Kearns and Singh. "Near-Optimal Reinforcement Learning in Polynomial Time." Machine Learning, 2002.](https://mlanthology.org/mlj/2002/kearns2002mlj-nearoptimal/) doi:10.1023/A:1017984413808

BibTeX

@article{kearns2002mlj-nearoptimal,
  title     = {{Near-Optimal Reinforcement Learning in Polynomial Time}},
  author    = {Kearns, Michael J. and Singh, Satinder},
  journal   = {Machine Learning},
  year      = {2002},
  pages     = {209-232},
  doi       = {10.1023/A:1017984413808},
  volume    = {49},
  url       = {https://mlanthology.org/mlj/2002/kearns2002mlj-nearoptimal/}
}