Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

NeurIPS 2024

doi:10.52202/079017-1012 /neurips/2024/xu2024neurips-bag/

Abstract

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced JailTrickBench to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/JailTrickBench.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Xu et al. "Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs." Neural Information Processing Systems, 2024. doi:10.52202/079017-1012

Markdown

[Xu et al. "Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/xu2024neurips-bag/) doi:10.52202/079017-1012

BibTeX

@inproceedings{xu2024neurips-bag,
  title     = {{Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs}},
  author    = {Xu, Zhao and Liu, Fan and Liu, Hao},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-1012},
  url       = {https://mlanthology.org/neurips/2024/xu2024neurips-bag/}
}