GuardReasoner: Towards Reasoning-Based LLM Safeguards

Abstract

This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. To this end, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. Furthermore, we use the tuned models to mine the hard samples and present hard sample DPO to strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalization ability. The extensive experiments and analyses on 13 guardrail benchmarks demonstrate the superiority of GuardReasoner. Remarkably, it surpasses GPT-4o+CoT by 5.65% and LLaMA Guard 3 8B by 21.02% in terms of F1 score on average. We release the training data, codes, and models with different scales (1B, 3B, 8B) of GuardReasoner.

Cite

Text

Liu et al. "GuardReasoner: Towards Reasoning-Based LLM Safeguards." ICLR 2025 Workshops: FM-Wild, 2025.

Markdown

[Liu et al. "GuardReasoner: Towards Reasoning-Based LLM Safeguards." ICLR 2025 Workshops: FM-Wild, 2025.](https://mlanthology.org/iclrw/2025/liu2025iclrw-guardreasoner/)

BibTeX

@inproceedings{liu2025iclrw-guardreasoner,
  title     = {{GuardReasoner: Towards Reasoning-Based LLM Safeguards}},
  author    = {Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Xia, Jun and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan},
  booktitle = {ICLR 2025 Workshops: FM-Wild},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/liu2025iclrw-guardreasoner/}
}