Off-Policy Learning with Eligibility Traces: A Survey

Abstract

In the framework of Markov Decision Processes, we consider linear off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares- based), adopting a unified algorithmic view. Then, we highlight a systematic approach for adapting them to off-policy learning with eligibility traces. This leads to some known algorithms---off-policy LSTD($\lambda$), LSPE($\lambda$), TD($\lambda$), TDC/GQ($\lambda$)---and suggests new extensions ---off-policy FPKF($\lambda$), BRM($\lambda$), gBRM($\lambda$), GTD2($\lambda$). We describe a comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form, discuss their known convergence properties and illustrate their relative empirical behavior on Garnet problems. Our experiments suggest that the most standard algorithms on and off-policy LSTD($\lambda$)/LSPE($\lambda$)---and TD($\lambda$) if the feature space dimension is too large for a least-squares approach---perform the best.

Cite

Text

Geist and Scherrer. "Off-Policy Learning with Eligibility Traces: A Survey." Journal of Machine Learning Research, 2014.

Markdown

[Geist and Scherrer. "Off-Policy Learning with Eligibility Traces: A Survey." Journal of Machine Learning Research, 2014.](https://mlanthology.org/jmlr/2014/geist2014jmlr-offpolicy/)

BibTeX

@article{geist2014jmlr-offpolicy,
  title     = {{Off-Policy Learning with Eligibility Traces: A Survey}},
  author    = {Geist, Matthieu and Scherrer, Bruno},
  journal   = {Journal of Machine Learning Research},
  year      = {2014},
  pages     = {289-333},
  volume    = {15},
  url       = {https://mlanthology.org/jmlr/2014/geist2014jmlr-offpolicy/}
}