Coercing LLMs to Do and Reveal (almost) Anything
Abstract
It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak'' the model into outputting harmful text. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking and provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We then analyze the mechanism by which these attacks function, highlighting the use of glitch tokens, and the propensity of attacks to control the model by coercing it to simulate code.
Cite
Text
Geiping et al. "Coercing LLMs to Do and Reveal (almost) Anything." ICLR 2024 Workshops: SeT_LLM, 2024.Markdown
[Geiping et al. "Coercing LLMs to Do and Reveal (almost) Anything." ICLR 2024 Workshops: SeT_LLM, 2024.](https://mlanthology.org/iclrw/2024/geiping2024iclrw-coercing/)BibTeX
@inproceedings{geiping2024iclrw-coercing,
title = {{Coercing LLMs to Do and Reveal (almost) Anything}},
author = {Geiping, Jonas and Stein, Alex and Shu, Manli and Saifullah, Khalid and Wen, Yuxin and Goldstein, Tom},
booktitle = {ICLR 2024 Workshops: SeT_LLM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/geiping2024iclrw-coercing/}
}