PixNerd: Pixel Neural Field Diffusion
Abstract
The current success of diffusion transformers are built on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To avoid these problems, researchers return to pixel space modeling but at the cost of complicated cascade pipelines and increased token complexity. Motivated by the simple yet effective diffusion transformer architectures on the latent space, we propose to model pixel space diffusion using a large-patch diffusion transformer and employ neural fields to decode these large patches, leading to a single-stage streamlined end-to-end solution, which we coin as pixel neural field diffusion transformer (**PixNerd**). Thanks to the efficient neural field representation in PixNerd, we achieve **1.93 FID** on ImageNet 256x256 and nearly **8x lower latency** without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieves a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.
Cite
Text
Wang et al. "PixNerd: Pixel Neural Field Diffusion." International Conference on Learning Representations, 2026.Markdown
[Wang et al. "PixNerd: Pixel Neural Field Diffusion." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-pixnerd/)BibTeX
@inproceedings{wang2026iclr-pixnerd,
title = {{PixNerd: Pixel Neural Field Diffusion}},
author = {Wang, Shuai and Gao, Ziteng and Zhu, Chenhui and Huang, Weilin and Wang, Limin},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/wang2026iclr-pixnerd/}
}