Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers

Tony Tong Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir N Shavit, Ethan Perez

NeurIPSW 2024

/neuripsw/2024/wang2024neuripsw-jailbreak/

Abstract

Defending large language models against jailbreaks so that they never engage in a broad set of forbidden behaviors is an open problem. In this paper, we study if jailbreak-defense is more tractable if one only needs to forbid a very narrow set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are inadequate. In pursuit of a better defense, we develop our own classifier defense tailored to our bomb setting, which outperforms existing defenses on some axes but is still ultimately broken. We conclude that jailbreak-defense is unsolved, even in a narrow domain.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Wang et al. "Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers." NeurIPS 2024 Workshops: AdvML-Frontiers, 2024.

Markdown

[Wang et al. "Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers." NeurIPS 2024 Workshops: AdvML-Frontiers, 2024.](https://mlanthology.org/neuripsw/2024/wang2024neuripsw-jailbreak/)

BibTeX

@inproceedings{wang2024neuripsw-jailbreak,
  title     = {{Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers}},
  author    = {Wang, Tony Tong and Hughes, John and Sleight, Henry and Schaeffer, Rylan and Agrawal, Rajashree and Barez, Fazl and Sharma, Mrinank and Mu, Jesse and Shavit, Nir N and Perez, Ethan},
  booktitle = {NeurIPS 2024 Workshops: AdvML-Frontiers},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/wang2024neuripsw-jailbreak/}
}