Off-Policy TD(λ) with a True Online Equivalence

van Hasselt, Hado; Mahmood, Ashique Rupam; Sutton, Richard S.

Off-Policy TD(λ) with a True Online Equivalence

Hado van Hasselt, Ashique Rupam Mahmood, Richard S. Sutton

UAI 2014 pp. 330-339

/uai/2014/vanhasselt2014uai-off/

Abstract

Van Seijen and Sutton (2014) recently proposed a new version of the linear TD(λ) learning algorithm that is exactly equivalent to an online forward view and that empirically performed better than its classical counterpart in both prediction and control problems. However, their algorithm is restricted to on-policy learning. In the more general case of off-policy learning, in which the policy whose outcome is predicted and the policy used to generate data may be different, their algorithm cannot be applied. One reason for this is that the algorithm bootstraps and thus is subject to instability problems when function approximation is used. A second reason true online TD(λ) cannot be used for off-policy learning is that the off-policy case requires sophisticated importance sampling in its eligibility traces. To address these limitations, we generalize their equivalence result and use this generalization to construct the first online algorithm to be exactly equivalent to an off-policy forward view. We show this algorithm, named true online GTD(λ), empirically outperforms GTD(λ) (Maei, 2011) which was derived from the same objective as our forward view but lacks the exact online equivalence. In the general theorem that allows us to derive this new algorithm, we encounter a new general eligibility-trace update.

PDF UAI Semantic Scholar

Cite

Text

van Hasselt et al. "Off-Policy TD(λ) with a True Online Equivalence." Conference on Uncertainty in Artificial Intelligence, 2014.

Markdown

[van Hasselt et al. "Off-Policy TD(λ) with a True Online Equivalence." Conference on Uncertainty in Artificial Intelligence, 2014.](https://mlanthology.org/uai/2014/vanhasselt2014uai-off/)

BibTeX

@inproceedings{vanhasselt2014uai-off,
  title     = {{Off-Policy TD(λ) with a True Online Equivalence}},
  author    = {van Hasselt, Hado and Mahmood, Ashique Rupam and Sutton, Richard S.},
  booktitle = {Conference on Uncertainty in Artificial Intelligence},
  year      = {2014},
  pages     = {330-339},
  url       = {https://mlanthology.org/uai/2014/vanhasselt2014uai-off/}
}