Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers
Abstract
Defending large language models against jailbreaks so that they never engage in a broad set of forbidden behaviors is an open problem. In this paper, we study if jailbreak-defense is more tractable if one only needs to forbid a very narrow set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are inadequate. In pursuit of a better defense, we develop our own classifier defense tailored to our bomb setting, which outperforms existing defenses on some axes but is still ultimately broken. We conclude that jailbreak-defense is unsolved, even in a narrow domain.
Cite
Text
Wang et al. "Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers." NeurIPS 2024 Workshops: AdvML-Frontiers, 2024.Markdown
[Wang et al. "Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers." NeurIPS 2024 Workshops: AdvML-Frontiers, 2024.](https://mlanthology.org/neuripsw/2024/wang2024neuripsw-jailbreak/)BibTeX
@inproceedings{wang2024neuripsw-jailbreak,
title = {{Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers}},
author = {Wang, Tony Tong and Hughes, John and Sleight, Henry and Schaeffer, Rylan and Agrawal, Rajashree and Barez, Fazl and Sharma, Mrinank and Mu, Jesse and Shavit, Nir N and Perez, Ethan},
booktitle = {NeurIPS 2024 Workshops: AdvML-Frontiers},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/wang2024neuripsw-jailbreak/}
}