Temporal Difference Learning as Gradient Splitting
Abstract
Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We provide an interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a fuller explanation of why temporal difference works, this interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-\gamma)$ in front of the bound, with $\gamma$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-\gamma)$ only multiplies an asymptotically negligible term.
Cite
Text
Liu and Olshevsky. "Temporal Difference Learning as Gradient Splitting." International Conference on Machine Learning, 2021.Markdown
[Liu and Olshevsky. "Temporal Difference Learning as Gradient Splitting." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/liu2021icml-temporal/)BibTeX
@inproceedings{liu2021icml-temporal,
title = {{Temporal Difference Learning as Gradient Splitting}},
author = {Liu, Rui and Olshevsky, Alex},
booktitle = {International Conference on Machine Learning},
year = {2021},
pages = {6905-6913},
volume = {139},
url = {https://mlanthology.org/icml/2021/liu2021icml-temporal/}
}