Plentiful Jailbreaks with String Compositions
Abstract
Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.
Cite
Text
Huang. "Plentiful Jailbreaks with String Compositions." NeurIPS 2024 Workshops: SoLaR, 2024.Markdown
[Huang. "Plentiful Jailbreaks with String Compositions." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/huang2024neuripsw-plentiful-a/)BibTeX
@inproceedings{huang2024neuripsw-plentiful-a,
title = {{Plentiful Jailbreaks with String Compositions}},
author = {Huang, Brian R.Y.},
booktitle = {NeurIPS 2024 Workshops: SoLaR},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/huang2024neuripsw-plentiful-a/}
}