Coercing LLMs to Do and Reveal (almost) Anything

Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein

ICLRW 2024

/iclrw/2024/geiping2024iclrw-coercing/

Abstract

It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak'' the model into outputting harmful text. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking and provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We then analyze the mechanism by which these attacks function, highlighting the use of glitch tokens, and the propensity of attacks to control the model by coercing it to simulate code.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Geiping et al. "Coercing LLMs to Do and Reveal (almost) Anything." ICLR 2024 Workshops: SeT_LLM, 2024.

Markdown

[Geiping et al. "Coercing LLMs to Do and Reveal (almost) Anything." ICLR 2024 Workshops: SeT_LLM, 2024.](https://mlanthology.org/iclrw/2024/geiping2024iclrw-coercing/)

BibTeX

@inproceedings{geiping2024iclrw-coercing,
  title     = {{Coercing LLMs to Do and Reveal (almost) Anything}},
  author    = {Geiping, Jonas and Stein, Alex and Shu, Manli and Saifullah, Khalid and Wen, Yuxin and Goldstein, Tom},
  booktitle = {ICLR 2024 Workshops: SeT_LLM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/geiping2024iclrw-coercing/}
}