An Off-Policy Policy Gradient Theorem Using Emphatic Weightings
Abstract
Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm—called Actor Critic with Emphatic weightings (ACE)—that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas ACE finds the optimal solution.
Cite
Text
Imani et al. "An Off-Policy Policy Gradient Theorem Using Emphatic Weightings." Neural Information Processing Systems, 2018.Markdown
[Imani et al. "An Off-Policy Policy Gradient Theorem Using Emphatic Weightings." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/imani2018neurips-offpolicy/)BibTeX
@inproceedings{imani2018neurips-offpolicy,
title = {{An Off-Policy Policy Gradient Theorem Using Emphatic Weightings}},
author = {Imani, Ehsan and Graves, Eric and White, Martha},
booktitle = {Neural Information Processing Systems},
year = {2018},
pages = {96-106},
url = {https://mlanthology.org/neurips/2018/imani2018neurips-offpolicy/}
}