Discovered Policy Optimisation

Abstract

The last decade has been revolutionary for reinforcement learning (RL) — it can now solve complex decision and control problems. Successful RL methods were handcrafted using mathematical derivations, intuition, and experimentation. This approach has a major shortcoming—it results in specific solutions to the RL problem, rather than a protocol for discovering efficient and robust methods. In contrast, the emerging field of meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not been successful. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential framework. In this paper we explore the Mirror Learning space by meta-learning a “drift” function. We refer to the result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.

Cite

Text

Lu et al. "Discovered Policy Optimisation." ICML 2022 Workshops: DARL, 2022.

Markdown

[Lu et al. "Discovered Policy Optimisation." ICML 2022 Workshops: DARL, 2022.](https://mlanthology.org/icmlw/2022/lu2022icmlw-discovered/)

BibTeX

@inproceedings{lu2022icmlw-discovered,
  title     = {{Discovered Policy Optimisation}},
  author    = {Lu, Chris and Kuba, Jakub Grudzien and Letcher, Alistair and Metz, Luke and de Witt, Christian Schroeder and Foerster, Jakob Nicolaus},
  booktitle = {ICML 2022 Workshops: DARL},
  year      = {2022},
  url       = {https://mlanthology.org/icmlw/2022/lu2022icmlw-discovered/}
}