Policy Gradient Methods for Reinforcement Learning with Function Approximation
Abstract
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and deter(cid:173) mining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, indepen(cid:173) dent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Cite
Text
Sutton et al. "Policy Gradient Methods for Reinforcement Learning with Function Approximation." Neural Information Processing Systems, 1999.Markdown
[Sutton et al. "Policy Gradient Methods for Reinforcement Learning with Function Approximation." Neural Information Processing Systems, 1999.](https://mlanthology.org/neurips/1999/sutton1999neurips-policy/)BibTeX
@inproceedings{sutton1999neurips-policy,
title = {{Policy Gradient Methods for Reinforcement Learning with Function Approximation}},
author = {Sutton, Richard S. and McAllester, David A. and Singh, Satinder P. and Mansour, Yishay},
booktitle = {Neural Information Processing Systems},
year = {1999},
pages = {1057-1063},
url = {https://mlanthology.org/neurips/1999/sutton1999neurips-policy/}
}