Convergence of Least Squares Temporal Difference Methods Under General Conditions
Abstract
We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in the off-policy learning context and with the simulation-based least squares temporal difference algorithm, LSTD($\lambda$). We establish for the discounted cost criterion that the off-policy LSTD($\lambda$) converges almost surely under mild, minimal conditions. We also analyze other convergence and boundedness properties of the iterates involved in the algorithm, and based on them, we suggest a modification in its practical implementation. Our analysis uses theories of both finite space Markov chains and Markov chains on topological spaces.
Cite
Text
Yu. "Convergence of Least Squares Temporal Difference Methods Under General Conditions." International Conference on Machine Learning, 2010.Markdown
[Yu. "Convergence of Least Squares Temporal Difference Methods Under General Conditions." International Conference on Machine Learning, 2010.](https://mlanthology.org/icml/2010/yu2010icml-convergence/)BibTeX
@inproceedings{yu2010icml-convergence,
title = {{Convergence of Least Squares Temporal Difference Methods Under General Conditions}},
author = {Yu, Huizhen},
booktitle = {International Conference on Machine Learning},
year = {2010},
pages = {1207-1214},
url = {https://mlanthology.org/icml/2010/yu2010icml-convergence/}
}