GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection
Abstract
Large language models (LLMs) are increasingly deployed in real-world applications but remain highly vulnerable to jailbreak prompts that bypass safety guardrails and elicit harmful outputs. We propose GraphShield, a graph-theoretic jailbreak detector that models information routing inside the LLM as token--layer graphs. Unlike prior defenses that rely on surface cues or costly gradient signals, GraphShield captures network-level dynamics in a lightweight and model-agnostic way by extracting multi-scale structural and semantic features that reveal jailbreak signatures. Extensive experiments on LLaMA-2-7B-Chat and Vicuna-7B-v1.5 show that GraphShield reduces attack success rates to 1.9% and 7.8%, respectively, while keeping refusal rates on benign prompts at 7.1% and 6.8%, significantly improving the robustness–utility trade-off compared to strong baselines. These results demonstrate that graph-theoretic modeling of network-level dynamics provides a principled and effective framework for robust jailbreak detection in LLMs.
Cite
Text
Dong et al. "GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection." International Conference on Learning Representations, 2026.Markdown
[Dong et al. "GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/dong2026iclr-graphshield/)BibTeX
@inproceedings{dong2026iclr-graphshield,
title = {{GraphShield: Graph-Theoretic Modeling of Network-Level Dynamics for Robust Jailbreak Detection}},
author = {Dong, Sunghee and Yi, Sungwon and Bae, Kangmin and Kim, Jaeyoon and Kim, Seongyeop},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/dong2026iclr-graphshield/}
}