Policy Optimization with Second-Order Advantage Information

Abstract

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide \& deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

Cite

Text

Li et al. "Policy Optimization with Second-Order Advantage Information." International Joint Conference on Artificial Intelligence, 2018. doi:10.24963/IJCAI.2018/699

Markdown

[Li et al. "Policy Optimization with Second-Order Advantage Information." International Joint Conference on Artificial Intelligence, 2018.](https://mlanthology.org/ijcai/2018/li2018ijcai-policy/) doi:10.24963/IJCAI.2018/699

BibTeX

@inproceedings{li2018ijcai-policy,
  title     = {{Policy Optimization with Second-Order Advantage Information}},
  author    = {Li, Jiajin and Wang, Baoxiang and Zhang, Shengyu},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2018},
  pages     = {5038-5044},
  doi       = {10.24963/IJCAI.2018/699},
  url       = {https://mlanthology.org/ijcai/2018/li2018ijcai-policy/}
}