A Counterexample to Temporal Differences Learning

Abstract

Sutton's TD(λ) method aims to provide a representation of the cost function in an absorbing Markov chain with transition costs. A simple example is given where the representation obtained depends on λ. For λ = 1 the representation is optimal with respect to a least-squares error criterion, but as λ decreases toward 0 the representation becomes progressively worse and, in some cases, very poor. The example suggests a need to understand better the circumstances under which TD(0) and Q-learning obtain satisfactory neural network-based compact representations of the cost function. A variation of TD(0) is also given, which performs better on the example.

Cite

Text

Bertsekas. "A Counterexample to Temporal Differences Learning." Neural Computation, 1995. doi:10.1162/NECO.1995.7.2.270

Markdown

[Bertsekas. "A Counterexample to Temporal Differences Learning." Neural Computation, 1995.](https://mlanthology.org/neco/1995/bertsekas1995neco-counterexample/) doi:10.1162/NECO.1995.7.2.270

BibTeX

@article{bertsekas1995neco-counterexample,
  title     = {{A Counterexample to Temporal Differences Learning}},
  author    = {Bertsekas, Dimitri P.},
  journal   = {Neural Computation},
  year      = {1995},
  pages     = {270-279},
  doi       = {10.1162/NECO.1995.7.2.270},
  volume    = {7},
  url       = {https://mlanthology.org/neco/1995/bertsekas1995neco-counterexample/}
}