Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Xu, Gangwei; Lin, Haotong; Luo, Hongcheng; Wang, Xianqi; Yao, Jingfeng; Zhu, Lianghui; Pu, Yuechuan; Chi_, Cheng; Sun, Haiyang; Wang, Bing; Chen, Guang; Ye, Hangjun; Peng, Sida; Yang, Xin

Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi_, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang

NeurIPS 2025

/neurips/2025/xu2025neurips-pixelperfect/

Abstract

This paper presents **Pixel-Perfect Depth**, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into the latent space, which inevitably introduces flying pixels at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) **Semantics-Prompted Diffusion Transformers** (**SP-DiT**), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) **Cascade DiT Design** that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation. Project page: https://pixel-perfect-depth.github.io/.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Xu et al. "Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers." Advances in Neural Information Processing Systems, 2025.

Markdown

[Xu et al. "Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/xu2025neurips-pixelperfect/)

BibTeX

@inproceedings{xu2025neurips-pixelperfect,
  title     = {{Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers}},
  author    = {Xu, Gangwei and Lin, Haotong and Luo, Hongcheng and Wang, Xianqi and Yao, Jingfeng and Zhu, Lianghui and Pu, Yuechuan and Chi_, Cheng and Sun, Haiyang and Wang, Bing and Chen, Guang and Ye, Hangjun and Peng, Sida and Yang, Xin},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/xu2025neurips-pixelperfect/}
}