Online Meta-Critic Learning for Off-Policy Actor-Critic Methods
Abstract
Off-Policy Actor-Critic (OffP-AC) methods have proven successful in a variety of continuous control tasks. Normally, the critic's action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a flexible and augmented meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning benefits to a variety of continuous control tasks when combined with contemporary OffP-AC methods DDPG, TD3 and SAC.
Cite
Text
Zhou et al. "Online Meta-Critic Learning for Off-Policy Actor-Critic Methods." Neural Information Processing Systems, 2020.Markdown
[Zhou et al. "Online Meta-Critic Learning for Off-Policy Actor-Critic Methods." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/zhou2020neurips-online/)BibTeX
@inproceedings{zhou2020neurips-online,
title = {{Online Meta-Critic Learning for Off-Policy Actor-Critic Methods}},
author = {Zhou, Wei and Li, Yiying and Yang, Yongxin and Wang, Huaimin and Hospedales, Timothy},
booktitle = {Neural Information Processing Systems},
year = {2020},
url = {https://mlanthology.org/neurips/2020/zhou2020neurips-online/}
}