Learning from Delayed Rewards

Abstract

The thesis introduces the notion of reinforcement learning as learning to control a Markov Decision Process by incremental dynamic programming, and describes a range of algorithms for doing this, including Q-learning, for which a sketch of a proof of convergence is given.

Cite

Text

Watkins. "Learning from Delayed Rewards." PhD thesis, University of Cambridge, 1989.

Markdown

[Watkins. "Learning from Delayed Rewards." PhD thesis, University of Cambridge, 1989.](https://mlanthology.org/misc/1989/watkins1989misc-learning/)

BibTeX

@misc{watkins1989misc-learning,
  title     = {{Learning from Delayed Rewards}},
  author    = {Watkins, Christopher J. C. H.},
  howpublished = {PhD thesis, University of Cambridge},
  year      = {1989},
  url       = {https://mlanthology.org/misc/1989/watkins1989misc-learning/}
}