Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Abstract

Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. This paper identifies a potential issue when using the widely adopted expected safety constraints for LLM safety alignment, termed "safety compensation'', where the constraints are satisfied on expectation, but individual prompts may trade off safety, resulting in some responses being overly restrictive while others remain unsafe. To address this issue, we propose **Rectified Policy Optimization (RePO)**, which replaces the expected safety constraint with critical safety constraints imposed on every prompt. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments demonstrate that RePO outperforms strong baseline methods and significantly enhances LLM safety alignment.

Cite

Text

Peng et al. "Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization." Advances in Neural Information Processing Systems, 2025.

Markdown

[Peng et al. "Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/peng2025neurips-enhancing/)

BibTeX

@inproceedings{peng2025neurips-enhancing,
  title     = {{Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization}},
  author    = {Peng, Xiyue and Guo, Hengquan and Zhang, Jiawei and Zou, Dongqing and Shao, Ziyu and Wei, Honghao and Liu, Xin},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/peng2025neurips-enhancing/}
}