GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods

Huang, Ruixuan; Wang, Xunguang; Li, Zongjie; Wu, Daoyuan; Wang, Shuai

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods

Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang

ICLR 2026

/iclr/2026/huang2026iclr-guidedbench/

Abstract

Despite the growing interest in jailbreaks as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. With a systematic measurement study based on 37 jailbreak studies since 2022, we find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. In this paper, we introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset and GuidedEval, an evaluation system integrated with detailed case-by-case evaluation guidelines. Experiments demonstrate that GuidedBench offers more accurate evaluations of jailbreak performance, enabling meaningful comparisons across methods. GuidedEval reduces inter-evaluator variance by at least 76.03%, ensuring reliable and reproducible evaluations. We reveal why existing jailbreak benchmarks fail to evaluate accurately and suggest better evaluation practices.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Huang et al. "GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods." International Conference on Learning Representations, 2026.

Markdown

[Huang et al. "GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-guidedbench/)

BibTeX

@inproceedings{huang2026iclr-guidedbench,
  title     = {{GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods}},
  author    = {Huang, Ruixuan and Wang, Xunguang and Li, Zongjie and Wu, Daoyuan and Wang, Shuai},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/huang2026iclr-guidedbench/}
}