STAIR: Improving Safety Alignment with Introspective Reasoning

Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu

ICML 2025 pp. 76754-76777

/icml/2025/zhang2025icml-stair/

Abstract

Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). Specifically, we design a theoretically grounded reward for outcome evaluation to seek balance between helpfulness and safety. We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. We have open-sourced our code, datasets and models at https://github.com/thu-ml/STAIR.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Zhang et al. "STAIR: Improving Safety Alignment with Introspective Reasoning." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Zhang et al. "STAIR: Improving Safety Alignment with Introspective Reasoning." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhang2025icml-stair/)

BibTeX

@inproceedings{zhang2025icml-stair,
  title     = {{STAIR: Improving Safety Alignment with Introspective Reasoning}},
  author    = {Zhang, Yichi and Zhang, Siyuan and Huang, Yao and Xia, Zeyu and Fang, Zhengwei and Yang, Xiao and Duan, Ranjie and Yan, Dong and Dong, Yinpeng and Zhu, Jun},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {76754-76777},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/zhang2025icml-stair/}
}