Fast-dLLM V2: Efficient Block-Diffusion LLM

Wu, Chengyue; Zhang, Hao; Xue, Shuchen; Diao, Shizhe; Fu, Yonggan; Liu, Zhijian; Molchanov, Pavlo; Luo, Ping; Han, Song; Xie, Enze

Fast-dLLM V2: Efficient Block-Diffusion LLM

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie

ICLR 2026

/iclr/2026/wu2026iclr-fastdllm/

Abstract

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation—requiring only ∼1B tokens of fine-tuning. This represents a 500× reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model’s performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5× speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs—marking a significant step toward the practical deployment of fast and accurate LLMs.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wu et al. "Fast-dLLM V2: Efficient Block-Diffusion LLM." International Conference on Learning Representations, 2026.

Markdown

[Wu et al. "Fast-dLLM V2: Efficient Block-Diffusion LLM." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-fastdllm/)

BibTeX

@inproceedings{wu2026iclr-fastdllm,
  title     = {{Fast-dLLM V2: Efficient Block-Diffusion LLM}},
  author    = {Wu, Chengyue and Zhang, Hao and Xue, Shuchen and Diao, Shizhe and Fu, Yonggan and Liu, Zhijian and Molchanov, Pavlo and Luo, Ping and Han, Song and Xie, Enze},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wu2026iclr-fastdllm/}
}