P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL

Abstract

Safe Reinforcement Learning (SRL) algorithms aim to learn a policy that maximizes the reward while satisfying the safety constraints. One of the challenges in SRL is that it is often difficult to balance the two objectives of reward maximization and safety constraint satisfaction. Existing algorithms utilize constraint optimization techniques like penalty-based, barrier penalty-based, and Lagrangian-based dual or primal policy optimizations methods. However, they suffer from training oscillations and approximation errors, which impact the overall learning objectives. This paper proposes the Permeable Penalty Barrier-based Policy Optimization (P2BPO) algorithm that addresses this issue by allowing a small fraction of penalty beyond the penalty barrier, and a parameter is used to control this permeability. In addition, an adaptive penalty parameter is used instead of a constant one, which is initialized with a low value and increased gradually as the agent violates the safety constraints. We have also provided a theoretical proof of the proposed method's performance guarantee bound, which ensures that P2BPO can learn a policy satisfying the safety constraints with high probability while achieving a higher expected reward. Furthermore, we compare P2BPO with other SRL algorithms on various SRL tasks and demonstrate that it achieves better rewards while adhering to the constraints.

Cite

Text

Dey et al. "P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I19.30094

Markdown

[Dey et al. "P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/dey2024aaai-p/) doi:10.1609/AAAI.V38I19.30094

BibTeX

@inproceedings{dey2024aaai-p,
  title     = {{P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL}},
  author    = {Dey, Sumanta and Dasgupta, Pallab and Dey, Soumyajit},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {21029-21036},
  doi       = {10.1609/AAAI.V38I19.30094},
  url       = {https://mlanthology.org/aaai/2024/dey2024aaai-p/}
}