AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun

NeurIPSW 2023

/neuripsw/2023/zhu2023neuripsw-autodan/

Abstract

Large Language Models (LLMs) exhibit broad utility in diverse applications but remain vulnerable to jailbreak attacks, including hand-crafted and automated adversarial attacks, which can compromise their safety measures. However, recent work suggests that patching LLMs against these attacks is possible: manual jailbreak attacks are human-readable but often limited and public, making them easy to block, while automated adversarial attacks generate gibberish prompts that can be detected using perplexity-based filters. In this paper, we propose an interpretable adversarial attack, \texttt{AutoDAN}, that combines the strengths of both types of attacks. It automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate like manual jailbreak attacks. These prompts are interpretable, exhibiting strategies commonly used in manual jailbreak attacks. Moreover, these interpretable prompts transfer better than their non-readable counterparts, especially when using limited data or a single proxy model. Beyond eliciting harmful content, we also customize the objective of \texttt{AutoDAN} to leak system prompts, demonstrating its versatility. Our work underscores the seemingly intrinsic vulnerability of LLMs to interpretable adversarial attacks.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Zhu et al. "AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models." NeurIPS 2023 Workshops: SoLaR, 2023.

Markdown

[Zhu et al. "AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models." NeurIPS 2023 Workshops: SoLaR, 2023.](https://mlanthology.org/neuripsw/2023/zhu2023neuripsw-autodan/)

BibTeX

@inproceedings{zhu2023neuripsw-autodan,
  title     = {{AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models}},
  author    = {Zhu, Sicheng and Zhang, Ruiyi and An, Bang and Wu, Gang and Barrow, Joe and Wang, Zichao and Huang, Furong and Nenkova, Ani and Sun, Tong},
  booktitle = {NeurIPS 2023 Workshops: SoLaR},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/zhu2023neuripsw-autodan/}
}