Technical Update: Least-Squares Temporal Difference Learning

Abstract

TD.λ/ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD.λ/ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3 , 33–57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.

Cite

Text

Boyan. "Technical Update: Least-Squares Temporal Difference Learning." Machine Learning, 2002. doi:10.1023/A:1017936530646

Markdown

[Boyan. "Technical Update: Least-Squares Temporal Difference Learning." Machine Learning, 2002.](https://mlanthology.org/mlj/2002/boyan2002mlj-technical/) doi:10.1023/A:1017936530646

BibTeX

@article{boyan2002mlj-technical,
  title     = {{Technical Update: Least-Squares Temporal Difference Learning}},
  author    = {Boyan, Justin A.},
  journal   = {Machine Learning},
  year      = {2002},
  pages     = {233-246},
  doi       = {10.1023/A:1017936530646},
  volume    = {49},
  url       = {https://mlanthology.org/mlj/2002/boyan2002mlj-technical/}
}