Learning from Delayed Rewards
Abstract
The thesis introduces the notion of reinforcement learning as learning to control a Markov Decision Process by incremental dynamic programming, and describes a range of algorithms for doing this, including Q-learning, for which a sketch of a proof of convergence is given.
Cite
Text
Watkins. "Learning from Delayed Rewards." PhD thesis, University of Cambridge, 1989.Markdown
[Watkins. "Learning from Delayed Rewards." PhD thesis, University of Cambridge, 1989.](https://mlanthology.org/misc/1989/watkins1989misc-learning/)BibTeX
@misc{watkins1989misc-learning,
title = {{Learning from Delayed Rewards}},
author = {Watkins, Christopher J. C. H.},
howpublished = {PhD thesis, University of Cambridge},
year = {1989},
url = {https://mlanthology.org/misc/1989/watkins1989misc-learning/}
}