Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach

Abstract

The REINFORCE algorithm \cite{williams1992simple} is popular in policy gradient (PG) for solving reinforcement learning (RL) problems. Meanwhile, the theoretical form of PG is from \cite{sutton1999policy}. Although both formulae prescribe PG, their precise connections are not yet illustrated. Recently, \citeauthor{nota2020policy} (\citeyear{nota2020policy}) have found that the ambiguity causes implementation errors. Motivated by the ambiguity and implementation incorrectness, we study PG from a perturbation perspective. In particular, we derive PG in a unified framework, precisely clarify the relation between PG implementation and theory, and echos back the findings by \citeauthor{nota2020policy}. Diving into factors contributing to empirical successes of the existing erroneous implementations, we find that small approximation error and the experience replay mechanism play critical roles.

Cite

Text

Wu et al. "Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach." International Conference on Machine Learning, 2022.

Markdown

[Wu et al. "Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/wu2022icml-understanding/)

BibTeX

@inproceedings{wu2022icml-understanding,
  title     = {{Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach}},
  author    = {Wu, Shuang and Shi, Ling and Wang, Jun and Tian, Guangjian},
  booktitle = {International Conference on Machine Learning},
  year      = {2022},
  pages     = {24131-24149},
  volume    = {162},
  url       = {https://mlanthology.org/icml/2022/wu2022icml-understanding/}
}