A Convergent Form of Approximate Policy Iteration

Abstract

We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solu- tion from any initial policy. To our knowledge, this is the first conver- gence result for any form of approximate policy iteration under similar computational-resource assumptions.

Cite

Text

Perkins and Precup. "A Convergent Form of Approximate Policy Iteration." Neural Information Processing Systems, 2002.

Markdown

[Perkins and Precup. "A Convergent Form of Approximate Policy Iteration." Neural Information Processing Systems, 2002.](https://mlanthology.org/neurips/2002/perkins2002neurips-convergent/)

BibTeX

@inproceedings{perkins2002neurips-convergent,
  title     = {{A Convergent Form of Approximate Policy Iteration}},
  author    = {Perkins, Theodore J. and Precup, Doina},
  booktitle = {Neural Information Processing Systems},
  year      = {2002},
  pages     = {1627-1634},
  url       = {https://mlanthology.org/neurips/2002/perkins2002neurips-convergent/}
}