Pyramid Patchification Flow for Visual Generation

Abstract

Diffusion Transformers (DiTs) typically use the same patch size for $\operatorname{Patchify}$ across timesteps, enforcing a constant token budget across timesteps. In this paper, we introduce Pyramidal Patchification Flow (PPFlow), which reduces the number of tokens for high-noise timesteps to improve the sampling efficiency. The idea is simple: use larger patches at higher-noise timesteps and smaller patches at lower-noise timesteps. The implementation is easy: share the DiT's transformer blocks across timesteps, and learn separate linear projections for different patch sizes in $\operatorname{Patchify}$ and $\operatorname{Unpatchify}$. Unlike Pyramidal Flow that operates on pyramid representations,, our approach operates over full latent representations, eliminating trajectory ``jump points'', and thus avoiding re-noising tricks for sampling. Training from pretrained SiT-XL/2 requires only $+8.9\%$ additional training FLOPs and delivers $2.02\times$ denoising speedups with image generation quality kept; training from scratch achieves comparable sampling speedup, e.g., $2.04\times$ speedup in SiT-B. Training from text-to-image model FLUX.1, PPFlow can achieve $1.61 - 1.86 \times$ speedup from 512 to 2048 resolution with comparable quality.

Cite

Text

Li et al. "Pyramid Patchification Flow for Visual Generation." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "Pyramid Patchification Flow for Visual Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-pyramid/)

BibTeX

@inproceedings{li2026iclr-pyramid,
  title     = {{Pyramid Patchification Flow for Visual Generation}},
  author    = {Li, Hui and Chen, Baoyou and Jiaye, Li and Wang, Jingdong and Zhu, Siyu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-pyramid/}
}