Compatible Natural Gradient Policy Search

Abstract

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

Cite

Text

Pajarinen et al. "Compatible Natural Gradient Policy Search." Machine Learning, 2019. doi:10.1007/S10994-019-05807-0

Markdown

[Pajarinen et al. "Compatible Natural Gradient Policy Search." Machine Learning, 2019.](https://mlanthology.org/mlj/2019/pajarinen2019mlj-compatible/) doi:10.1007/S10994-019-05807-0

BibTeX

@article{pajarinen2019mlj-compatible,
  title     = {{Compatible Natural Gradient Policy Search}},
  author    = {Pajarinen, Joni and Thai, Hong Linh and Akrour, Riad and Peters, Jan and Neumann, Gerhard},
  journal   = {Machine Learning},
  year      = {2019},
  pages     = {1443-1466},
  doi       = {10.1007/S10994-019-05807-0},
  volume    = {108},
  url       = {https://mlanthology.org/mlj/2019/pajarinen2019mlj-compatible/}
}