Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Abstract

Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity—their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across four mathematical benchmarks—GSM8K, Math500, AMC and Minerva—achieving new state-of-the-art results for full-attention masked dLLMs.

Cite

Text

Zhao et al. "Inpainting-Guided Policy Optimization for Diffusion Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Zhao et al. "Inpainting-Guided Policy Optimization for Diffusion Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhao2026iclr-inpaintingguided/)

BibTeX

@inproceedings{zhao2026iclr-inpaintingguided,
  title     = {{Inpainting-Guided Policy Optimization for Diffusion Large Language Models}},
  author    = {Zhao, Siyan and Liu, Mengchen and Huang, Jing and Liu, Miao and Wang, Chenyu and Liu, Bo and Tian, Yuandong and Pang, Guan and Bell, Sean and Grover, Aditya and Chen, Feiyu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhao2026iclr-inpaintingguided/}
}