Maximum a Posteriori Policy Optimisation

Abstract

We introduce a new algorithm for reinforcement learning called Maximum a-posteriori Policy Optimisation (MPO) based on coordinate ascent on a relative-entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings.

Cite

Text

Abdolmaleki et al. "Maximum a Posteriori Policy Optimisation." International Conference on Learning Representations, 2018.

Markdown

[Abdolmaleki et al. "Maximum a Posteriori Policy Optimisation." International Conference on Learning Representations, 2018.](https://mlanthology.org/iclr/2018/abdolmaleki2018iclr-maximum/)

BibTeX

@inproceedings{abdolmaleki2018iclr-maximum,
  title     = {{Maximum a Posteriori Policy Optimisation}},
  author    = {Abdolmaleki, Abbas and Springenberg, Jost Tobias and Tassa, Yuval and Munos, Remi and Heess, Nicolas and Riedmiller, Martin},
  booktitle = {International Conference on Learning Representations},
  year      = {2018},
  url       = {https://mlanthology.org/iclr/2018/abdolmaleki2018iclr-maximum/}
}