On the Convergence of Optimistic Policy Iteration

Abstract

We consider a finite-state Markov decision problem and establish the convergence of a special case of optimistic policy iteration that involves Monte Carlo estimation of Q-values, in conjunction with greedy policy selection. We provide convergence results for a number of algorithmic variations, including one that involves temporal difference learning (bootstrapping) instead of Monte Carlo estimation. We also indicate some extensions that either fail or are unlikely to go through.

Cite

Text

Tsitsiklis. "On the Convergence of Optimistic Policy Iteration." Journal of Machine Learning Research, 2002.

Markdown

[Tsitsiklis. "On the Convergence of Optimistic Policy Iteration." Journal of Machine Learning Research, 2002.](https://mlanthology.org/jmlr/2002/tsitsiklis2002jmlr-convergence/)

BibTeX

@article{tsitsiklis2002jmlr-convergence,
  title     = {{On the Convergence of Optimistic Policy Iteration}},
  author    = {Tsitsiklis, John N.},
  journal   = {Journal of Machine Learning Research},
  year      = {2002},
  pages     = {59-72},
  volume    = {3},
  url       = {https://mlanthology.org/jmlr/2002/tsitsiklis2002jmlr-convergence/}
}