Off-Policy Temporal Difference Learning with Function Approximation

Abstract

We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(λ) over state–action pairs with importance sampling ideas from our previous work. We prove that, given training under any ɛ-soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current results are limited to episodic tasks with episodes of bounded length. 1 Although Q-learning remains the most popular of all reinforcement learning algorithms, it has been known since about 1996 that it is unsound with linear function approximation (see Gordon, 1995; Bertsekas and Tsitsiklis, 1996). The most telling counterexample, due to Baird (1995) is a seven-state Markov decision process with linearly independent feature vectors, for which an exact solution exists, yet 1 This is a re-typeset version of an article published in the Proceedings

Cite

Text

Precup et al. "Off-Policy Temporal Difference Learning with Function Approximation." International Conference on Machine Learning, 2001.

Markdown

[Precup et al. "Off-Policy Temporal Difference Learning with Function Approximation." International Conference on Machine Learning, 2001.](https://mlanthology.org/icml/2001/precup2001icml-off/)

BibTeX

@inproceedings{precup2001icml-off,
  title     = {{Off-Policy Temporal Difference Learning with Function Approximation}},
  author    = {Precup, Doina and Sutton, Richard S. and Dasgupta, Sanjoy},
  booktitle = {International Conference on Machine Learning},
  year      = {2001},
  pages     = {417-424},
  url       = {https://mlanthology.org/icml/2001/precup2001icml-off/}
}