DiffuGuard: How Intrinsic Safety Is Lost and Found in Diffusion Large Language Models

Abstract

The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: *intra-step* and *inter-step* dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose **DiffuGuard**, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: **Stochastic Annealing Remasking** dynamically introduces controlled randomness to mitigate greedy selection bias, while **Block-level Audit and Repair** exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from **47.9%** to **14.7%** while preserving model utility and efficiency.

Cite

Text

Li et al. "DiffuGuard: How Intrinsic Safety Is Lost and Found in Diffusion Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "DiffuGuard: How Intrinsic Safety Is Lost and Found in Diffusion Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-diffuguard/)

BibTeX

@inproceedings{li2026iclr-diffuguard,
  title     = {{DiffuGuard: How Intrinsic Safety Is Lost and Found in Diffusion Large Language Models}},
  author    = {Li, Zherui and Nie, Zheng and Zhou, Zhenhong and Liu, Yue and Zhang, Yitong and Cheng, Yu and Wen, Qingsong and Wang, Kun and Guo, Yufei and Zhang, Jiaheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-diffuguard/}
}