Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

Abstract

We introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that—when combined with a user’s query—disrupts the attacked model’s alignment, resulting in unintended and potentially harmful outputs. To our knowledge this is the first automated universal black box jailbreak attack.

Cite

Text

Lapid et al. "Open Sesame! Universal Black-Box Jailbreaking of Large Language Models." ICLR 2024 Workshops: SeT_LLM, 2024.

Markdown

[Lapid et al. "Open Sesame! Universal Black-Box Jailbreaking of Large Language Models." ICLR 2024 Workshops: SeT_LLM, 2024.](https://mlanthology.org/iclrw/2024/lapid2024iclrw-open/)

BibTeX

@inproceedings{lapid2024iclrw-open,
  title     = {{Open Sesame! Universal Black-Box Jailbreaking of Large Language Models}},
  author    = {Lapid, Raz and Langberg, Ron and Sipper, Moshe},
  booktitle = {ICLR 2024 Workshops: SeT_LLM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/lapid2024iclrw-open/}
}