On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method

Abstract

Policy gradient (PG) gives rise to a rich class of reinforcement learning (RL) methods. Recently, there has been an emerging trend to augment the existing PG methods such as REINFORCE by the \emph{variance reduction} techniques. However, all existing variance-reduced PG methods heavily rely on an uncheckable importance weight assumption made for every single iteration of the algorithms. In this paper, a simple gradient truncation mechanism is proposed to address this issue. Moreover, we design a Truncated Stochastic Incremental Variance-Reduced Policy Gradient (TSIVR-PG) method, which is able to maximize not only a cumulative sum of rewards but also a general utility function over a policy's long-term visiting distribution. We show an $\tilde{\mathcal{O}}(\epsilon^{-3})$ sample complexity for TSIVR-PG to find an $\epsilon$-stationary policy. By assuming the \emph{overparameterization} of policy and exploiting the \emph{hidden convexity} of the problem, we further show that TSIVR-PG converges to global $\epsilon$-optimal policy with $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples.

Cite

Text

Zhang et al. "On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method." Neural Information Processing Systems, 2021.

Markdown

[Zhang et al. "On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/zhang2021neurips-convergence/)

BibTeX

@inproceedings{zhang2021neurips-convergence,
  title     = {{On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method}},
  author    = {Zhang, Junyu and Ni, Chengzhuo and Yu, Zheng and Szepesvari, Csaba and Wang, Mengdi},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/zhang2021neurips-convergence/}
}