A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning

Liu, Bo; Feng, Xidong; Ren, Jie; Mai, Luo; Zhu, Rui; Zhang, Haifeng; Wang, Jun; Yang, Yaodong

A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning

Bo Liu, Xidong Feng, Jie Ren, Luo Mai, Rui Zhu, Haifeng Zhang, Jun Wang, Yaodong Yang

NeurIPS 2022

/neurips/2022/liu2022neurips-theoretical/

Abstract

Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures wherein the outer-loop meta-learner guides the inner-loop gradient-based reinforcement learner to achieve fast adaptations. In this paper, we develop a unified framework that describes variations of GMRL algorithms and points out that existing stochastic meta-gradient estimators adopted by GMRL are actually \textbf{biased}. Such meta-gradient bias comes from two sources: 1) the compositional bias incurred by the two-level problem structure, which has an upper bound of $\mathcal{O}\big(K\alpha^{K}\hat{\sigma}_{\text{In}}|\tau|^{-0.5}\big)$ \emph{w.r.t.} inner-loop update step $K$, learning rate $\alpha$, estimate variance $\hat{\sigma}^{2}_{\text{In}}$ and sample size $|\tau|$, and 2) the multi-step Hessian estimation bias $\hat{\Delta}_{H}$ due to the use of autodiff, which has a polynomial impact $\mathcal{O}\big((K-1)(\hat{\Delta}_{H})^{K-1}\big)$ on the meta-gradient bias. We study tabular MDPs empirically and offer quantitative evidence that testifies our theoretical findings on existing stochastic meta-gradient estimators. Furthermore, we conduct experiments on Iterated Prisoner's Dilemma and Atari games to show how other methods such as off-policy learning and low-bias estimator can help fix the gradient bias for GMRL algorithms in general.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Liu et al. "A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning." Neural Information Processing Systems, 2022.

Markdown

[Liu et al. "A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/liu2022neurips-theoretical/)

BibTeX

@inproceedings{liu2022neurips-theoretical,
  title     = {{A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning}},
  author    = {Liu, Bo and Feng, Xidong and Ren, Jie and Mai, Luo and Zhu, Rui and Zhang, Haifeng and Wang, Jun and Yang, Yaodong},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/liu2022neurips-theoretical/}
}