Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention

Abstract

Jailbreak attacks reveal a persistent gap between the intended alignment of language models and their actual behavior during inference. To address this, we investigate how such attacks succeed at the internal level of model computation, focusing on attention heads. Unlike previous studies that primarily analyzed why jailbreaks work, our approach aims to develop a defense mechanism. We identify attention heads that influence whether a model produces a harmful or safe response by comparing activation patterns between a harmful prompt that is rejected and its adversarial variant that elicits a harmful response. By interpolating the internal representations of these heads between the two scenarios, we suppress harmful outputs while maintaining appropriate responses to benign prompts. Experiments with representative jailbreak methods, including GCG and AutoDAN, show that our method significantly reduces attack success rates without degrading response quality. For instance, with Llama-2-7b-chat, the average success rate drops from 39.3% to 1.1%. These findings reveal how internal attention dynamics affect output generation and demonstrate that targeted manipulation of internal components can enhance safety without requiring external filters or additional training.

Cite

Text

Arai et al. "Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention." Proceedings of the 17th Asian Conference on Machine Learning, 2025.

Markdown

[Arai et al. "Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention." Proceedings of the 17th Asian Conference on Machine Learning, 2025.](https://mlanthology.org/acml/2025/arai2025acml-jailbreak/)

BibTeX

@inproceedings{arai2025acml-jailbreak,
  title     = {{Jailbreak Defense in LLM via Attention Head Analysis and Selective Intervention}},
  author    = {Arai, Masaki and Shibahara, Toshiki and Chiba, Daiki and Akiyama, Mitsuaki and Uchida, Masato},
  booktitle = {Proceedings of the 17th Asian Conference on Machine Learning},
  year      = {2025},
  pages     = {351-366},
  volume    = {304},
  url       = {https://mlanthology.org/acml/2025/arai2025acml-jailbreak/}
}