A Finite-Sample Analysis of Multi-Step Temporal Difference Estimates
Abstract
We consider the problem of estimating the value function of an infinite-horizon $\gamma$-discounted Markov reward process (MRP). We establish non-asymptotic guarantees for a general family of multi-step temporal difference (TD) estimates, including canonical $K$-step look-ahead TD for $K = 1, 2, \ldots$ and the TD$(\lambda)$ family for $\lambda \in [0,1)$ as special cases. Our bounds capture the dependence of these estimates on both the variance as defined by Bellman fluctuations, and the bias arising from possible model mis-specification. Our results reveal that the variance component shows limited sensitivity to the choice of look-ahead defining the estimator itself, while increasing the look-ahead can reduce the bias term. This highlights the benefit of using a larger look-ahead: it reduces bias but need not increase the variance.
Cite
Text
Duan and Wainwright. "A Finite-Sample Analysis of Multi-Step Temporal Difference Estimates." Proceedings of The 5th Annual Learning for Dynamics and Control Conference, 2023.Markdown
[Duan and Wainwright. "A Finite-Sample Analysis of Multi-Step Temporal Difference Estimates." Proceedings of The 5th Annual Learning for Dynamics and Control Conference, 2023.](https://mlanthology.org/l4dc/2023/duan2023l4dc-finitesample/)BibTeX
@inproceedings{duan2023l4dc-finitesample,
title = {{A Finite-Sample Analysis of Multi-Step Temporal Difference Estimates}},
author = {Duan, Yaqi and Wainwright, Martin J.},
booktitle = {Proceedings of The 5th Annual Learning for Dynamics and Control Conference},
year = {2023},
pages = {612-624},
volume = {211},
url = {https://mlanthology.org/l4dc/2023/duan2023l4dc-finitesample/}
}