A Self-Explaining Neural Architecture for Generalizable Concept Learning

Abstract

Most jailbreak methods for large language models (LLMs) focus on superficially improving attack success through manually defined rules. However, they fail to uncover the underlying mechanisms within target LLMs that explain why an attack succeeds or fails. In this paper, we propose investigating the phenomenon of jailbreaks and defenses for LLMs from the perspective of attention distributions within the models. A preliminary experiment reveals that the success of a jailbreak is closely linked to the LLM's attention on sensitive words. Inspired by this interesting finding, we propose incorporating critical signals derived from internal attention distributions within LLMs, namely Attention Intensity on Sensitive Words and Attention Dispersion Entropy, to guide both attacks and defenses. Drawing inspiration from the concept of "Feint and Attack", we introduce an attention-guided jailbreak model, ABA, which redirects the model's attention to benign contexts, and an attention-based defense model, ABD, designed to detect attacks by analyzing internal attention entropy. Experimental results demonstrate the superiority of our proposal when compared to SOTA baselines.

Cite

Text

Sinha et al. "A Self-Explaining Neural Architecture for Generalizable Concept Learning." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/56

Markdown

[Sinha et al. "A Self-Explaining Neural Architecture for Generalizable Concept Learning." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/sinha2024ijcai-self/) doi:10.24963/ijcai.2024/56

BibTeX

@inproceedings{sinha2024ijcai-self,
  title     = {{A Self-Explaining Neural Architecture for Generalizable Concept Learning}},
  author    = {Sinha, Sanchit and Xiong, Guangzhi and Zhang, Aidong},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {503-511},
  doi       = {10.24963/ijcai.2024/56},
  url       = {https://mlanthology.org/ijcai/2024/sinha2024ijcai-self/}
}