Multi-Turn Jailbreaking Large Language Models via Attention Shifting
Abstract
Large Language Models (LLMs) have achieved significant performance in various natural language processing tasks but also pose safety and ethical threats, thus requiring red teaming and alignment processes to bolster their safety. To effectively exploit these aligned LLMs, recent studies have introduced jailbreak attacks based on multi-turn dialogues. These attacks aim to prompt LLMs to generate harmful or biased content by guiding them through contextual content. However, the underlying reasons for the effectiveness of multi-turn jailbreaks remain unclear. Existing attacks often focus on optimizing queries and escalating toxicity to construct dialogues, lacking a thorough analysis of the inherent vulnerabilities of LLMs. In this paper, we first conduct an in-depth analysis of the differences between single-turn and multi-turn jailbreaks and find that successful multi-turn jailbreaks can effectively disperse the attention of LLMs on keywords associated with harmful behaviors, especially in historical responses. Based on this, we propose ASJA, a new multi-turn jailbreak approach by shifting the attention of LLMs, specifically by iteratively fabricating the dialogue history through a genetic algorithm to induce LLMs to generate harmful content. Extensive experiments on three LLMs and two datasets show that our approach surpasses existing approaches in jailbreak effectiveness, the stealth of jailbreak prompts, and attack efficiency. Our work emphasizes the importance of enhancing the robustness of LLMs' attention mechanism in multi-turn dialogue scenarios for a better defense strategy.
Cite
Text
Du et al. "Multi-Turn Jailbreaking Large Language Models via Attention Shifting." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I22.34553Markdown
[Du et al. "Multi-Turn Jailbreaking Large Language Models via Attention Shifting." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/du2025aaai-multi/) doi:10.1609/AAAI.V39I22.34553BibTeX
@inproceedings{du2025aaai-multi,
title = {{Multi-Turn Jailbreaking Large Language Models via Attention Shifting}},
author = {Du, Xiaohu and Mo, Fan and Wen, Ming and Gu, Tu and Zheng, Huadi and Jin, Hai and Shi, Jie},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {23814-23822},
doi = {10.1609/AAAI.V39I22.34553},
url = {https://mlanthology.org/aaai/2025/du2025aaai-multi/}
}