Neural Temporal-Difference Learning Converges to Global Optima
Abstract
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.
Cite
Text
Cai et al. "Neural Temporal-Difference Learning Converges to Global Optima." Neural Information Processing Systems, 2019.Markdown
[Cai et al. "Neural Temporal-Difference Learning Converges to Global Optima." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/cai2019neurips-neural/)BibTeX
@inproceedings{cai2019neurips-neural,
title = {{Neural Temporal-Difference Learning Converges to Global Optima}},
author = {Cai, Qi and Yang, Zhuoran and Lee, Jason and Wang, Zhaoran},
booktitle = {Neural Information Processing Systems},
year = {2019},
pages = {11315-11326},
url = {https://mlanthology.org/neurips/2019/cai2019neurips-neural/}
}