Policy Optimization with Second-Order Advantage Information
Abstract
Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide \& deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.
Cite
Text
Li et al. "Policy Optimization with Second-Order Advantage Information." International Joint Conference on Artificial Intelligence, 2018. doi:10.24963/IJCAI.2018/699Markdown
[Li et al. "Policy Optimization with Second-Order Advantage Information." International Joint Conference on Artificial Intelligence, 2018.](https://mlanthology.org/ijcai/2018/li2018ijcai-policy/) doi:10.24963/IJCAI.2018/699BibTeX
@inproceedings{li2018ijcai-policy,
title = {{Policy Optimization with Second-Order Advantage Information}},
author = {Li, Jiajin and Wang, Baoxiang and Zhang, Shengyu},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2018},
pages = {5038-5044},
doi = {10.24963/IJCAI.2018/699},
url = {https://mlanthology.org/ijcai/2018/li2018ijcai-policy/}
}