Many-Shot Jailbreaking

Abstract

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This attack is newly feasible with the larger context windows recently deployed by language model providers like Google DeepMind, OpenAI and Anthropic. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

Cite

Text

Anil et al. "Many-Shot Jailbreaking." Neural Information Processing Systems, 2024. doi:10.52202/079017-4121

Markdown

[Anil et al. "Many-Shot Jailbreaking." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/anil2024neurips-manyshot/) doi:10.52202/079017-4121

BibTeX

@inproceedings{anil2024neurips-manyshot,
  title     = {{Many-Shot Jailbreaking}},
  author    = {Anil, Cem and Durmus, Esin and Panickssery, Nina and Sharma, Mrinank and Benton, Joe and Kundu, Sandipan and Batson, Joshua and Tong, Meg and Mu, Jesse and Ford, Daniel and Mosconi, Fracesco and Agrawal, Rajashree and Schaeffer, Rylan and Bashkansky, Naomi and Svenningsen, Samuel and Lambert, Mike and Radhakrishnan, Ansh and Denison, Carson and Hubinger, Evan J and Bai, Yuntao and Bricken, Trenton and Maxwell, Timothy and Schiefer, Nicholas and Sully, James and Tamkin, Alex and Lanhan, Tamera and Nguyen, Karina and Korbak, Tomasz and Kaplan, Jared and Ganguli, Deep and Bowman, Samuel R. and Perez, Ethan and Grosse, Roger Baker and Duvenaud, David},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4121},
  url       = {https://mlanthology.org/neurips/2024/anil2024neurips-manyshot/}
}