An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning
Abstract
In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($\lambda$)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per- step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($\lambda$), and GQ$\lambda$). Compared to these methods, our emphatic TD($\lambda$) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state- dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.
Cite
Text
Sutton et al. "An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning." Journal of Machine Learning Research, 2016.Markdown
[Sutton et al. "An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning." Journal of Machine Learning Research, 2016.](https://mlanthology.org/jmlr/2016/sutton2016jmlr-emphatic/)BibTeX
@article{sutton2016jmlr-emphatic,
title = {{An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning}},
author = {Sutton, Richard S. and Mahmood, A. Rupam and White, Martha},
journal = {Journal of Machine Learning Research},
year = {2016},
pages = {1-29},
volume = {17},
url = {https://mlanthology.org/jmlr/2016/sutton2016jmlr-emphatic/}
}