Plentiful Jailbreaks with String Compositions

Abstract

Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Huang. "Plentiful Jailbreaks with String Compositions." NeurIPS 2024 Workshops: SoLaR, 2024.

Markdown

[Huang. "Plentiful Jailbreaks with String Compositions." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/huang2024neuripsw-plentiful-a/)

BibTeX

@inproceedings{huang2024neuripsw-plentiful-a,
  title     = {{Plentiful Jailbreaks with String Compositions}},
  author    = {Huang, Brian R.Y.},
  booktitle = {NeurIPS 2024 Workshops: SoLaR},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/huang2024neuripsw-plentiful-a/}
}