Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling
Abstract
Most jailbreak methods for large language models (LLMs) focus on superficially improving attack success through manually defined rules. However, they fail to uncover the underlying mechanisms within target LLMs that explain why an attack succeeds or fails. In this paper, we propose investigating the phenomenon of jailbreaks and defenses for LLMs from the perspective of attention distributions within the models. A preliminary experiment reveals that the success of a jailbreak is closely linked to the LLM's attention on sensitive words.Inspired by this interesting finding, we propose incorporating critical signals derived from internal attention distributions within LLMs, namely Attention Intensity on Sensitive Words and Attention Dispersion Entropy, to guide both attacks and defenses. Drawing inspiration from the concept of "Feint and Attack", we introduce an attention-guided jailbreak model, ABA, which redirects the model's attention to benign contexts, and an attention-based defense model, ABD, designed to detect attacks by analyzing internal attention entropy. Experimental results demonstrate the superiority of our proposal when compared to SOTA baselines.
Cite
Text
Pu et al. "Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/56Markdown
[Pu et al. "Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/pu2025ijcai-feint/) doi:10.24963/IJCAI.2025/56BibTeX
@inproceedings{pu2025ijcai-feint,
title = {{Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling}},
author = {Pu, Rui and Li, Chaozhuo and Ha, Rui and Chen, Zejian and Zhang, Litian and Liu, Zheng and Qiu, Lirong and Ye, Zaisheng},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {493-501},
doi = {10.24963/IJCAI.2025/56},
url = {https://mlanthology.org/ijcai/2025/pu2025ijcai-feint/}
}