Analysis of Temporal-Diffference Learning with Function Approximation
Abstract
We present new results about the temporal-difference learning al(cid:173) gorithm, as applied to approximating the cost-to-go function of a Markov chain using linear function approximators. The algo(cid:173) rithm we analyze performs on-line updating of a parameter vector during a single endless trajectory of an aperiodic irreducible finite state Markov chain. Results include convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. In addition to establishing new and stronger results than those previously available, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. Furthermore, we discuss the implications of two counter-examples with regards to the Significance of on-line updating and linearly parameterized function approximators.
Cite
Text
Tsitsiklis and Van Roy. "Analysis of Temporal-Diffference Learning with Function Approximation." Neural Information Processing Systems, 1996.Markdown
[Tsitsiklis and Van Roy. "Analysis of Temporal-Diffference Learning with Function Approximation." Neural Information Processing Systems, 1996.](https://mlanthology.org/neurips/1996/tsitsiklis1996neurips-analysis/)BibTeX
@inproceedings{tsitsiklis1996neurips-analysis,
title = {{Analysis of Temporal-Diffference Learning with Function Approximation}},
author = {Tsitsiklis, John N. and Van Roy, Benjamin},
booktitle = {Neural Information Processing Systems},
year = {1996},
pages = {1075-1081},
url = {https://mlanthology.org/neurips/1996/tsitsiklis1996neurips-analysis/}
}