GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods

Abstract

Despite the growing interest in jailbreaks as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. With a systematic measurement study based on 37 jailbreak studies since 2022, we find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. In this paper, we introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset and GuidedEval, an evaluation system integrated with detailed case-by-case evaluation guidelines. Experiments demonstrate that GuidedBench offers more accurate evaluations of jailbreak performance, enabling meaningful comparisons across methods. GuidedEval reduces inter-evaluator variance by at least 76.03%, ensuring reliable and reproducible evaluations. We reveal why existing jailbreak benchmarks fail to evaluate accurately and suggest better evaluation practices.

Cite

Text

Huang et al. "GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods." International Conference on Learning Representations, 2026.

Markdown

[Huang et al. "GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-guidedbench/)

BibTeX

@inproceedings{huang2026iclr-guidedbench,
  title     = {{GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods}},
  author    = {Huang, Ruixuan and Wang, Xunguang and Li, Zongjie and Wu, Daoyuan and Wang, Shuai},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/huang2026iclr-guidedbench/}
}