SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Abstract

Evaluating aligned large language models' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with **SORRY-Bench**, our proposed benchmark. **First**, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 44 potentially unsafe topics, and 440 class-balanced unsafe instructions, compiled through human-in-the-loop methods. **Second**, evaluations often overlook the linguistic formatting of prompts, like different languages, dialects, and more --- which are only implicitly considered in many evaluations. We supplement SORRY-bench with 20 diverse linguistic augmentations to systematically examine these effects. **Third**, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and efficient manner. Benchmark demo, data, code, and models are available through [https://sorry-bench.github.io](https://sorry-bench.github.io).

Cite

Text

Xie et al. "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal." International Conference on Learning Representations, 2025.

Markdown

[Xie et al. "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/xie2025iclr-sorrybench/)

BibTeX

@inproceedings{xie2025iclr-sorrybench,
  title     = {{SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal}},
  author    = {Xie, Tinghao and Qi, Xiangyu and Zeng, Yi and Huang, Yangsibo and Sehwag, Udari Madhushani and Huang, Kaixuan and He, Luxi and Wei, Boyi and Li, Dacheng and Sheng, Ying and Jia, Ruoxi and Li, Bo and Li, Kai and Chen, Danqi and Henderson, Peter and Mittal, Prateek},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/xie2025iclr-sorrybench/}
}