Generalized Exploration in Policy Search

Abstract

To learn control policies in unknown environments, learning agents need to explore by trying actions deemed suboptimal. In prior work, such exploration is performed by either perturbing the actions at each time-step independently, or by perturbing policy parameters over an entire episode. Since both of these strategies have certain advantages, a more balanced trade-off could be beneficial. We introduce a unifying view on step-based and episode-based exploration that allows for such balanced trade-offs. This trade-off strategy can be used with various reinforcement learning algorithms. In this paper, we study this generalized exploration strategy in a policy gradient method and in relative entropy policy search. We evaluate the exploration strategy on four dynamical systems and compare the results to the established step-based and episode-based exploration strategies. Our results show that a more balanced trade-off can yield faster learning and better final policies, and illustrate some of the effects that cause these performance differences.

Cite

Text

van Hoof et al. "Generalized Exploration in Policy Search." Machine Learning, 2017. doi:10.1007/S10994-017-5657-1

Markdown

[van Hoof et al. "Generalized Exploration in Policy Search." Machine Learning, 2017.](https://mlanthology.org/mlj/2017/vanhoof2017mlj-generalized/) doi:10.1007/S10994-017-5657-1

BibTeX

@article{vanhoof2017mlj-generalized,
  title     = {{Generalized Exploration in Policy Search}},
  author    = {van Hoof, Herke and Tanneberg, Daniel and Peters, Jan},
  journal   = {Machine Learning},
  year      = {2017},
  pages     = {1705-1724},
  doi       = {10.1007/S10994-017-5657-1},
  volume    = {106},
  url       = {https://mlanthology.org/mlj/2017/vanhoof2017mlj-generalized/}
}