More than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai

NeurIPS 2025

/neurips/2025/lin2025neurips-more/

Abstract

Generative depth estimation methods leverage the rich visual priors stored in pretrained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pretrained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed-parameters pretrained text-to-image model. MERGE demonstrates that the pretrained text-to-image model can do more than image generation but also expand to depth estimation effortlessly. Specifically, MERGE introduces a plug-and-play framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameter. MERGE unleashes the powerful depth estimation capability of the pretrained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code and model will be made available.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Lin et al. "More than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Lin et al. "More than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/lin2025neurips-more/)

BibTeX

@inproceedings{lin2025neurips-more,
  title     = {{More than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models}},
  author    = {Lin, Hongkai and Liang, Dingkang and Du, Mingyang and Zhou, Xin and Bai, Xiang},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/lin2025neurips-more/}
}