Polynomial-Time Reinforcement Learning of Near-Optimal Policies

Abstract

Inspired by recent results on polynomial time reinforcement algorithms that accumulate near-optimal rewards, we look at the related problem of quickly learning near-optimal policies. The new problem is obviously related to the previous one, but different in important ways. We provide simple algorithms for MDPs, zero-sum and common-payoff Stochastic Games, and a uniform framework for proving their polynomial complexity. Unlike the previously studied problem, these bounds use the minimum between the mixing time and a new quantity - the spectral radius. Unlike the previous results, our results apply uniformly to the average and discounted cases.

Cite

Text

Pivazyan and Shoham. "Polynomial-Time Reinforcement Learning of Near-Optimal Policies." AAAI Conference on Artificial Intelligence, 2002. doi:10.5555/777092.777127

Markdown

[Pivazyan and Shoham. "Polynomial-Time Reinforcement Learning of Near-Optimal Policies." AAAI Conference on Artificial Intelligence, 2002.](https://mlanthology.org/aaai/2002/pivazyan2002aaai-polynomial/) doi:10.5555/777092.777127

BibTeX

@inproceedings{pivazyan2002aaai-polynomial,
  title     = {{Polynomial-Time Reinforcement Learning of Near-Optimal Policies}},
  author    = {Pivazyan, Karèn and Shoham, Yoav},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2002},
  pages     = {205-210},
  doi       = {10.5555/777092.777127},
  url       = {https://mlanthology.org/aaai/2002/pivazyan2002aaai-polynomial/}
}