Breaking Safety Paradox with Feasible Dual Policy Iteration

Abstract

Achieving zero constraint violations in safe reinforcement learning poses a significant challenge. We discover a key obstacle called the safety paradox, where improving policy safety reduces the frequency of constraint-violating samples, thereby impairing feasibility function estimation and ultimately undermining policy safety. We theoretically prove that the estimation error bound of the feasibility function increases as the proportion of violating samples decreases. To overcome the safety paradox, we propose an algorithm called feasible dual policy iteration (FDPI), which employs an additional policy to strategically maximize constraint violations while staying close to the original policy. Samples from both policies are combined for training, with data distribution corrected by importance sampling. Extensive experiments show FDPI's state-of-the-art performance on the Safety-Gymnasium benchmark, achieving the lowest violation and competitive-to-best return simultaneously.

Cite

Text

Yang et al. "Breaking Safety Paradox with Feasible Dual Policy Iteration." International Conference on Learning Representations, 2026.

Markdown

[Yang et al. "Breaking Safety Paradox with Feasible Dual Policy Iteration." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yang2026iclr-breaking/)

BibTeX

@inproceedings{yang2026iclr-breaking,
  title     = {{Breaking Safety Paradox with Feasible Dual Policy Iteration}},
  author    = {Yang, Yujie and Teh, Jinglin and Lin, Ziyu and Yu, Kaicheng and Zhang, Tao and Li, Shengbo Eben},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yang2026iclr-breaking/}
}