On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

Abstract

Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(λ) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(λ) and Q-learning belong.

Cite

Text

Jaakkola et al. "On the Convergence of Stochastic Iterative Dynamic Programming Algorithms." Neural Computation, 1994. doi:10.1162/NECO.1994.6.6.1185

Markdown

[Jaakkola et al. "On the Convergence of Stochastic Iterative Dynamic Programming Algorithms." Neural Computation, 1994.](https://mlanthology.org/neco/1994/jaakkola1994neco-convergence/) doi:10.1162/NECO.1994.6.6.1185

BibTeX

@article{jaakkola1994neco-convergence,
  title     = {{On the Convergence of Stochastic Iterative Dynamic Programming Algorithms}},
  author    = {Jaakkola, Tommi S. and Jordan, Michael I. and Singh, Satinder P.},
  journal   = {Neural Computation},
  year      = {1994},
  pages     = {1185-1201},
  doi       = {10.1162/NECO.1994.6.6.1185},
  volume    = {6},
  url       = {https://mlanthology.org/neco/1994/jaakkola1994neco-convergence/}
}