Gumbel Distillation for Parallel Text Generation

Zhang, Chi; Hu, Xixi; Liu, Bo; Liu, Qiang

Gumbel Distillation for Parallel Text Generation

ICLR 2026

/iclr/2026/zhang2026iclr-gumbel/

Abstract

The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-autoregressive models often sacrifice generation quality because they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE Score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "Gumbel Distillation for Parallel Text Generation." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "Gumbel Distillation for Parallel Text Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-gumbel/)

BibTeX

@inproceedings{zhang2026iclr-gumbel,
  title     = {{Gumbel Distillation for Parallel Text Generation}},
  author    = {Zhang, Chi and Hu, Xixi and Liu, Bo and Liu, Qiang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-gumbel/}
}