A Temporal-Difference Approach to Policy Gradient Estimation
Abstract
The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.
Cite
Text
Tosatto et al. "A Temporal-Difference Approach to Policy Gradient Estimation." International Conference on Machine Learning, 2022.Markdown
[Tosatto et al. "A Temporal-Difference Approach to Policy Gradient Estimation." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/tosatto2022icml-temporaldifference/)BibTeX
@inproceedings{tosatto2022icml-temporaldifference,
title = {{A Temporal-Difference Approach to Policy Gradient Estimation}},
author = {Tosatto, Samuele and Patterson, Andrew and White, Martha and Mahmood, Rupam},
booktitle = {International Conference on Machine Learning},
year = {2022},
pages = {21609-21632},
volume = {162},
url = {https://mlanthology.org/icml/2022/tosatto2022icml-temporaldifference/}
}