A Convergent Form of Approximate Policy Iteration
Abstract
We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solu- tion from any initial policy. To our knowledge, this is the first conver- gence result for any form of approximate policy iteration under similar computational-resource assumptions.
Cite
Text
Perkins and Precup. "A Convergent Form of Approximate Policy Iteration." Neural Information Processing Systems, 2002.Markdown
[Perkins and Precup. "A Convergent Form of Approximate Policy Iteration." Neural Information Processing Systems, 2002.](https://mlanthology.org/neurips/2002/perkins2002neurips-convergent/)BibTeX
@inproceedings{perkins2002neurips-convergent,
title = {{A Convergent Form of Approximate Policy Iteration}},
author = {Perkins, Theodore J. and Precup, Doina},
booktitle = {Neural Information Processing Systems},
year = {2002},
pages = {1627-1634},
url = {https://mlanthology.org/neurips/2002/perkins2002neurips-convergent/}
}