Dynamics of Temporal Difference Learning

Abstract

In behavioural sciences, the problem that a sequence of stimuli is followed by a sequence of rewards r(t) is considered. The subject is to learn the full sequence of rewards from the stimuli, where the prediction is modelled by the Sutton-Barto rule. In a sequence of n trials, this prediction rule is learned iteratively by temporal difference learning. We present a closed formula of the prediction of rewards at trial time t within trial n. From that formula, we show directly that for n to infinity, the predictions converge to the real rewards. In this approach, a new quality of correlation type Toeplitz matrices is proven. We give learning rates which optimally speed up the learning process.

Cite

Text

Wendemuth. "Dynamics of Temporal Difference Learning." International Joint Conference on Artificial Intelligence, 2007.

Markdown

[Wendemuth. "Dynamics of Temporal Difference Learning." International Joint Conference on Artificial Intelligence, 2007.](https://mlanthology.org/ijcai/2007/wendemuth2007ijcai-dynamics/)

BibTeX

@inproceedings{wendemuth2007ijcai-dynamics,
  title     = {{Dynamics of Temporal Difference Learning}},
  author    = {Wendemuth, Andreas},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2007},
  pages     = {1107-1112},
  url       = {https://mlanthology.org/ijcai/2007/wendemuth2007ijcai-dynamics/}
}