Policy Search by Dynamic Programming
Abstract
We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem.
Cite
Text
Bagnell et al. "Policy Search by Dynamic Programming." Neural Information Processing Systems, 2003.Markdown
[Bagnell et al. "Policy Search by Dynamic Programming." Neural Information Processing Systems, 2003.](https://mlanthology.org/neurips/2003/bagnell2003neurips-policy/)BibTeX
@inproceedings{bagnell2003neurips-policy,
title = {{Policy Search by Dynamic Programming}},
author = {Bagnell, J. A. and Kakade, Sham M. and Schneider, Jeff G. and Ng, Andrew Y.},
booktitle = {Neural Information Processing Systems},
year = {2003},
pages = {831-838},
url = {https://mlanthology.org/neurips/2003/bagnell2003neurips-policy/}
}