GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection

Abstract

Large language models (LLMs) are increasingly deployed in real-world applications but remain highly vulnerable to jailbreak prompts that bypass safety guardrails and elicit harmful outputs. We propose GraphShield, a graph-theoretic jailbreak detector that models information routing inside the LLM as token--layer graphs. Unlike prior defenses that rely on surface cues or costly gradient signals, GraphShield captures network-level dynamics in a lightweight and model-agnostic way by extracting multi-scale structural and semantic features that reveal jailbreak signatures. Extensive experiments on LLaMA-2-7B-Chat and Vicuna-7B-v1.5 show that GraphShield reduces attack success rates to 1.9% and 7.8%, respectively, while keeping refusal rates on benign prompts at 7.1% and 6.8%, significantly improving the robustness–utility trade-off compared to strong baselines. These results demonstrate that graph-theoretic modeling of network-level dynamics provides a principled and effective framework for robust jailbreak detection in LLMs.

Cite

Text

Dong et al. "GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection." International Conference on Learning Representations, 2026.

Markdown

[Dong et al. "GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/dong2026iclr-graphshield/)

BibTeX

@inproceedings{dong2026iclr-graphshield,
  title     = {{GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection}},
  author    = {Dong, Sunghee and Yi, Sungwon and Bae, Kangmin and Kim, Jaeyoon and Kim, Seongyeop},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/dong2026iclr-graphshield/}
}