GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods
Abstract
Despite the growing interest in jailbreaks as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. With a systematic measurement study based on 37 jailbreak studies since 2022, we find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. In this paper, we introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset and GuidedEval, an evaluation system integrated with detailed case-by-case evaluation guidelines. Experiments demonstrate that GuidedBench offers more accurate evaluations of jailbreak performance, enabling meaningful comparisons across methods. GuidedEval reduces inter-evaluator variance by at least 76.03%, ensuring reliable and reproducible evaluations. We reveal why existing jailbreak benchmarks fail to evaluate accurately and suggest better evaluation practices.
Cite
Text
Huang et al. "GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods." International Conference on Learning Representations, 2026.Markdown
[Huang et al. "GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-guidedbench/)BibTeX
@inproceedings{huang2026iclr-guidedbench,
title = {{GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-Wild LLM Jailbreak Methods}},
author = {Huang, Ruixuan and Wang, Xunguang and Li, Zongjie and Wu, Daoyuan and Wang, Shuai},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/huang2026iclr-guidedbench/}
}