A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation
Abstract
Temporal difference learning (TD) is a simple iterative algorithm widely used for policy evaluation in Markov reward processes. Bhandari et al. prove finite time convergence rates for TD learning with linear function approximation. The analysis follows using a key insight that establishes rigorous connections between TD updates and those of online gradient descent. In a model where observations are corrupted by i.i.d. noise, convergence results for TD follow by essentially mirroring the analysis for online gradient descent. Using an information-theoretic technique, the authors also provide results for the case when TD is applied to a single Markovian data stream where the algorithm’s updates can be severely biased. Their analysis seamlessly extends to the study of TD learning with eligibility traces and Q-learning for high-dimensional optimal stopping problems.
Cite
Text
Bhandari et al. "A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation." Annual Conference on Computational Learning Theory, 2018. doi:10.1287/OPRE.2020.2024Markdown
[Bhandari et al. "A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation." Annual Conference on Computational Learning Theory, 2018.](https://mlanthology.org/colt/2018/bhandari2018colt-finite/) doi:10.1287/OPRE.2020.2024BibTeX
@inproceedings{bhandari2018colt-finite,
title = {{A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation}},
author = {Bhandari, Jalaj and Russo, Daniel and Singal, Raghav},
booktitle = {Annual Conference on Computational Learning Theory},
year = {2018},
pages = {1691-1692},
doi = {10.1287/OPRE.2020.2024},
url = {https://mlanthology.org/colt/2018/bhandari2018colt-finite/}
}