Safe and Efficient Off-Policy Reinforcement Learning

Abstract

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(lambda), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(lambda), which was an open problem since 1989. We illustrate the benefits of Retrace(lambda) on a standard suite of Atari 2600 games.

Cite

Text

Munos et al. "Safe and Efficient Off-Policy Reinforcement Learning." Neural Information Processing Systems, 2016.

Markdown

[Munos et al. "Safe and Efficient Off-Policy Reinforcement Learning." Neural Information Processing Systems, 2016.](https://mlanthology.org/neurips/2016/munos2016neurips-safe/)

BibTeX

@inproceedings{munos2016neurips-safe,
  title     = {{Safe and Efficient Off-Policy Reinforcement Learning}},
  author    = {Munos, Remi and Stepleton, Tom and Harutyunyan, Anna and Bellemare, Marc},
  booktitle = {Neural Information Processing Systems},
  year      = {2016},
  pages     = {1054-1062},
  url       = {https://mlanthology.org/neurips/2016/munos2016neurips-safe/}
}