Policy Gradient Methods for Reinforcement Learning with Function Approximation

Abstract

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and deter(cid:173) mining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, indepen(cid:173) dent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Cite

Text

Sutton et al. "Policy Gradient Methods for Reinforcement Learning with Function Approximation." Neural Information Processing Systems, 1999.

Markdown

[Sutton et al. "Policy Gradient Methods for Reinforcement Learning with Function Approximation." Neural Information Processing Systems, 1999.](https://mlanthology.org/neurips/1999/sutton1999neurips-policy/)

BibTeX

@inproceedings{sutton1999neurips-policy,
  title     = {{Policy Gradient Methods for Reinforcement Learning with Function Approximation}},
  author    = {Sutton, Richard S. and McAllester, David A. and Singh, Satinder P. and Mansour, Yishay},
  booktitle = {Neural Information Processing Systems},
  year      = {1999},
  pages     = {1057-1063},
  url       = {https://mlanthology.org/neurips/1999/sutton1999neurips-policy/}
}