Policy Optimization with Stochastic Mirror Descent

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMP outperforms the state-of-the-art policy gradient methods in various settings.

Cite

Text

Yang et al. "Policy Optimization with Stochastic Mirror Descent." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I8.20863

Markdown

[Yang et al. "Policy Optimization with Stochastic Mirror Descent." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/yang2022aaai-policy/) doi:10.1609/AAAI.V36I8.20863

BibTeX

@inproceedings{yang2022aaai-policy,
  title     = {{Policy Optimization with Stochastic Mirror Descent}},
  author    = {Yang, Long and Zhang, Yu and Zheng, Gang and Zheng, Qian and Li, Pengfei and Huang, Jianhang and Pan, Gang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {8823-8831},
  doi       = {10.1609/AAAI.V36I8.20863},
  url       = {https://mlanthology.org/aaai/2022/yang2022aaai-policy/}
}