STAIR: Improving Safety Alignment with Introspective Reasoning

Abstract

Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). Specifically, we design a theoretically grounded reward for outcome evaluation to seek balance between helpfulness and safety. We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. We have open-sourced our code, datasets and models at https://github.com/thu-ml/STAIR.

Cite

Text

Zhang et al. "STAIR: Improving Safety Alignment with Introspective Reasoning." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Zhang et al. "STAIR: Improving Safety Alignment with Introspective Reasoning." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhang2025icml-stair/)

BibTeX

@inproceedings{zhang2025icml-stair,
  title     = {{STAIR: Improving Safety Alignment with Introspective Reasoning}},
  author    = {Zhang, Yichi and Zhang, Siyuan and Huang, Yao and Xia, Zeyu and Fang, Zhengwei and Yang, Xiao and Duan, Ranjie and Yan, Dong and Dong, Yinpeng and Zhu, Jun},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {76754-76777},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/zhang2025icml-stair/}
}