A Natural Policy Gradient

Abstract

We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the param(cid:173) eter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradi(cid:173) ent is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sut(cid:173) ton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.

Cite

Text

Kakade. "A Natural Policy Gradient." Neural Information Processing Systems, 2001.

Markdown

[Kakade. "A Natural Policy Gradient." Neural Information Processing Systems, 2001.](https://mlanthology.org/neurips/2001/kakade2001neurips-natural/)

BibTeX

@inproceedings{kakade2001neurips-natural,
  title     = {{A Natural Policy Gradient}},
  author    = {Kakade, Sham M.},
  booktitle = {Neural Information Processing Systems},
  year      = {2001},
  pages     = {1531-1538},
  url       = {https://mlanthology.org/neurips/2001/kakade2001neurips-natural/}
}