Analysis of Temporal-Diffference Learning with Function Approximation

Abstract

We present new results about the temporal-difference learning al(cid:173) gorithm, as applied to approximating the cost-to-go function of a Markov chain using linear function approximators. The algo(cid:173) rithm we analyze performs on-line updating of a parameter vector during a single endless trajectory of an aperiodic irreducible finite state Markov chain. Results include convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. In addition to establishing new and stronger results than those previously available, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. Furthermore, we discuss the implications of two counter-examples with regards to the Significance of on-line updating and linearly parameterized function approximators.

Cite

Text

Tsitsiklis and Van Roy. "Analysis of Temporal-Diffference Learning with Function Approximation." Neural Information Processing Systems, 1996.

Markdown

[Tsitsiklis and Van Roy. "Analysis of Temporal-Diffference Learning with Function Approximation." Neural Information Processing Systems, 1996.](https://mlanthology.org/neurips/1996/tsitsiklis1996neurips-analysis/)

BibTeX

@inproceedings{tsitsiklis1996neurips-analysis,
  title     = {{Analysis of Temporal-Diffference Learning with Function Approximation}},
  author    = {Tsitsiklis, John N. and Van Roy, Benjamin},
  booktitle = {Neural Information Processing Systems},
  year      = {1996},
  pages     = {1075-1081},
  url       = {https://mlanthology.org/neurips/1996/tsitsiklis1996neurips-analysis/}
}