A Natural Policy Gradient
Abstract
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the param(cid:173) eter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradi(cid:173) ent is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sut(cid:173) ton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
Cite
Text
Kakade. "A Natural Policy Gradient." Neural Information Processing Systems, 2001.Markdown
[Kakade. "A Natural Policy Gradient." Neural Information Processing Systems, 2001.](https://mlanthology.org/neurips/2001/kakade2001neurips-natural/)BibTeX
@inproceedings{kakade2001neurips-natural,
title = {{A Natural Policy Gradient}},
author = {Kakade, Sham M.},
booktitle = {Neural Information Processing Systems},
year = {2001},
pages = {1531-1538},
url = {https://mlanthology.org/neurips/2001/kakade2001neurips-natural/}
}