Policy Search by Dynamic Programming

Abstract

We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem.

Cite

Text

Bagnell et al. "Policy Search by Dynamic Programming." Neural Information Processing Systems, 2003.

Markdown

[Bagnell et al. "Policy Search by Dynamic Programming." Neural Information Processing Systems, 2003.](https://mlanthology.org/neurips/2003/bagnell2003neurips-policy/)

BibTeX

@inproceedings{bagnell2003neurips-policy,
  title     = {{Policy Search by Dynamic Programming}},
  author    = {Bagnell, J. A. and Kakade, Sham M. and Schneider, Jeff G. and Ng, Andrew Y.},
  booktitle = {Neural Information Processing Systems},
  year      = {2003},
  pages     = {831-838},
  url       = {https://mlanthology.org/neurips/2003/bagnell2003neurips-policy/}
}