Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk

Abstract

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty caused by stochastic policies and environment variability. To address this issue, we propose a novel reinforcement learning framework of CVaR-Proximal-Policy-Optimization (CPPO) by rating the conditional value-at-risk (CVaR) as an assessment for risk. We show that performance degradation under observation state disturbance and transition probability disturbance theoretically depends on the range of disturbance as well as the gap of value function between different states. Therefore, constraining the value function among states with CVaR can improve the robustness of the policy. Experimental results show that CPPO achieves higher cumulative reward and exhibits stronger robustness against observation state disturbance and transition probability disturbance in environment dynamics among a series of continuous control tasks in MuJoCo.

Cite

Text

Ying et al. "Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk." ICML 2021 Workshops: AML, 2021.

Markdown

[Ying et al. "Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk." ICML 2021 Workshops: AML, 2021.](https://mlanthology.org/icmlw/2021/ying2021icmlw-safe/)

BibTeX

@inproceedings{ying2021icmlw-safe,
  title     = {{Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk}},
  author    = {Ying, Chengyang and Zhou, Xinning and Yan, Dong and Zhu, Jun},
  booktitle = {ICML 2021 Workshops: AML},
  year      = {2021},
  url       = {https://mlanthology.org/icmlw/2021/ying2021icmlw-safe/}
}