An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning

Abstract

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($\lambda$)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per- step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($\lambda$), and GQ$\lambda$). Compared to these methods, our emphatic TD($\lambda$) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state- dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.

Cite

Text

Sutton et al. "An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning." Journal of Machine Learning Research, 2016.

Markdown

[Sutton et al. "An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning." Journal of Machine Learning Research, 2016.](https://mlanthology.org/jmlr/2016/sutton2016jmlr-emphatic/)

BibTeX

@article{sutton2016jmlr-emphatic,
  title     = {{An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning}},
  author    = {Sutton, Richard S. and Mahmood, A. Rupam and White, Martha},
  journal   = {Journal of Machine Learning Research},
  year      = {2016},
  pages     = {1-29},
  volume    = {17},
  url       = {https://mlanthology.org/jmlr/2016/sutton2016jmlr-emphatic/}
}