Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Abstract

Value-based reinforcement-learning algorithms are currently state-of-the-art in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are currently limited by their need for an on-policy critic, which severely constraints how the critic is learned. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free actor-critic reinforcement-learning algorithm for continuous states and discrete actions, with off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we show approximates Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable and, contrary to other state-of-the-art algorithms, unusually forgiving for poorly-configured hyper-parameters. BDPI is significantly more sample-efficient compared to Bootstrapped DQN, PPO, A3C and ACKTR, on a variety of tasks. Source code: this https URL.

Cite

Text

Steckelmacher et al. "Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019. doi:10.1007/978-3-030-46133-1_2

Markdown

[Steckelmacher et al. "Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019.](https://mlanthology.org/ecmlpkdd/2019/steckelmacher2019ecmlpkdd-sampleefficient/) doi:10.1007/978-3-030-46133-1_2

BibTeX

@inproceedings{steckelmacher2019ecmlpkdd-sampleefficient,
  title     = {{Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics}},
  author    = {Steckelmacher, Denis and Plisnier, Hélène and Roijers, Diederik M. and Nowé, Ann},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2019},
  pages     = {19-34},
  doi       = {10.1007/978-3-030-46133-1_2},
  url       = {https://mlanthology.org/ecmlpkdd/2019/steckelmacher2019ecmlpkdd-sampleefficient/}
}