Diffusion LLMs Can Do Faster-than-AR Inference via Discrete Diffusion Forcing

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs to achieve rapid convergence. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to the vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality.

Cite

Text

Wang et al. "Diffusion LLMs Can Do Faster-than-AR Inference via Discrete Diffusion Forcing." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "Diffusion LLMs Can Do Faster-than-AR Inference via Discrete Diffusion Forcing." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-diffusion/)

BibTeX

@inproceedings{wang2026iclr-diffusion,
  title     = {{Diffusion LLMs Can Do Faster-than-AR Inference via Discrete Diffusion Forcing}},
  author    = {Wang, Xu and Xu, Chenkai and Jin, Yijie and Jin, Jiachun and Zhang, Hao and Yu, Kai and Deng, Zhijie},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-diffusion/}
}