Leveraging Context in Jailbreaking Attacks

Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios Chrysos

ICLRW 2024

/iclrw/2024/cheng2024iclrw-leveraging/

Abstract

Large Language Models (LLMs) are powerful but vulnerable to Jailbreaking attacks aimed at eliciting harmful information through query modifications. As LLMs strengthen their defenses, directly triggering these attacks grows more difficult. Our approach, inspired by human practices of indirect context to elicit harmful information, Contextual Interaction Attack, draws from indirect methods to bypass these safeguards. It utilizes the autoregressive generation process of LLMs, emphasizing the critical role of prior context. By employing a series of non-harmful question-answer interactions, we subtly steer LLMs to produce harmful information. Tested across multiple LLMs, our black-box method proves effective and transferable, highlighting the importance of understanding and manipulating context vectors in LLM security research.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Cheng et al. "Leveraging Context in Jailbreaking Attacks." ICLR 2024 Workshops: SeT_LLM, 2024.

Markdown

[Cheng et al. "Leveraging Context in Jailbreaking Attacks." ICLR 2024 Workshops: SeT_LLM, 2024.](https://mlanthology.org/iclrw/2024/cheng2024iclrw-leveraging/)

BibTeX

@inproceedings{cheng2024iclrw-leveraging,
  title     = {{Leveraging Context in Jailbreaking Attacks}},
  author    = {Cheng, Yixin and Georgopoulos, Markos and Cevher, Volkan and Chrysos, Grigorios},
  booktitle = {ICLR 2024 Workshops: SeT_LLM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/cheng2024iclrw-leveraging/}
}