SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

Cao, Hongye; Jing, Sijia; Wang, Yanming; Peng, Ziyue; Bai, Zhixin; Cao, Zhe; Fang, Meng; Feng, Fan; Liu, Jiaheng; Wang, Boyan; Yang, Tianpei; Huo, Jing; Gao, Yang; Meng, Fanyu; Yang, Xi; Deng, Chao; Feng, Junlan

SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

Hongye Cao, Sijia Jing, Yanming Wang, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Jiaheng Liu, Boyan Wang, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, Junlan Feng

ICLR 2026

/iclr/2026/cao2026iclr-safedialbench/

Abstract

With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability to identify and handle unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative auto assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 19 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Cao et al. "SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks." International Conference on Learning Representations, 2026.

Markdown

[Cao et al. "SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/cao2026iclr-safedialbench/)

BibTeX

@inproceedings{cao2026iclr-safedialbench,
  title     = {{SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks}},
  author    = {Cao, Hongye and Jing, Sijia and Wang, Yanming and Peng, Ziyue and Bai, Zhixin and Cao, Zhe and Fang, Meng and Feng, Fan and Liu, Jiaheng and Wang, Boyan and Yang, Tianpei and Huo, Jing and Gao, Yang and Meng, Fanyu and Yang, Xi and Deng, Chao and Feng, Junlan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/cao2026iclr-safedialbench/}
}