Provably Efficient Neural GTD for Off-Policy Learning

Abstract

This paper studies a gradient temporal difference (GTD) algorithm using neural network (NN) function approximators to minimize the mean squared Bellman error (MSBE). For off-policy learning, we show that the minimum MSBE problem can be recast into a min-max optimization involving a pair of over-parameterized primal-dual NNs. The resultant formulation can then be tackled using a neural GTD algorithm. We analyze the convergence of the proposed algorithm with a 2-layer ReLU NN architecture using $m$ neurons and prove that it computes an approximate optimal solution to the minimum MSBE problem as $m \rightarrow \infty$.

Cite

Text

Wai et al. "Provably Efficient Neural GTD for Off-Policy Learning." Neural Information Processing Systems, 2020.

Markdown

[Wai et al. "Provably Efficient Neural GTD for Off-Policy Learning." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/wai2020neurips-provably/)

BibTeX

@inproceedings{wai2020neurips-provably,
  title     = {{Provably Efficient Neural GTD for Off-Policy Learning}},
  author    = {Wai, Hoi-To and Yang, Zhuoran and Wang, Zhaoran and Hong, Mingyi},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/wai2020neurips-provably/}
}