Off-Policy Actor-Critic with Shared Experience Replay

Abstract

We investigate the combination of actor-critic reinforcement learning algorithms with a uniform large-scale experience replay and propose solutions for two ensuing challenges: (a) efficient actor-critic learning with experience replay (b) the stability of off-policy learning where agents learn from other agents behaviour. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solutions on DMLab-30 and further show the benefits of this setup in two training regimes for Atari: (1) a single agent is trained up until 200M environment frames per game (2) a population of agents is trained up until 200M environment frames each and may share experience. We demonstrate state-of-the-art data efficiency among model-free agents in both regimes.

Cite

Text

Schmitt et al. "Off-Policy Actor-Critic with Shared Experience Replay." International Conference on Machine Learning, 2020.

Markdown

[Schmitt et al. "Off-Policy Actor-Critic with Shared Experience Replay." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/schmitt2020icml-offpolicy/)

BibTeX

@inproceedings{schmitt2020icml-offpolicy,
  title     = {{Off-Policy Actor-Critic with Shared Experience Replay}},
  author    = {Schmitt, Simon and Hessel, Matteo and Simonyan, Karen},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {8545-8554},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/schmitt2020icml-offpolicy/}
}