Efficient Fine-Tuning of Image-Conditional Diffusion Models for Depth and Surface Normal Estimation

Abstract

Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. We show that the inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configurations while being more than 200× faster. Furthermore, we show that end-to-end finetuning with task-specific losses enables deterministic single-step inference, outperforming previous diffusion-based depth and normal estimation models on common zero-shot benchmarks. This fine-tuning scheme works similarly well on Stable Diffusion directly.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Garcia et al. "Efficient Fine-Tuning of Image-Conditional Diffusion Models for Depth and Surface Normal Estimation." NeurIPS 2024 Workshops: AFM, 2024.

Markdown

[Garcia et al. "Efficient Fine-Tuning of Image-Conditional Diffusion Models for Depth and Surface Normal Estimation." NeurIPS 2024 Workshops: AFM, 2024.](https://mlanthology.org/neuripsw/2024/garcia2024neuripsw-efficient/)

BibTeX

@inproceedings{garcia2024neuripsw-efficient,
  title     = {{Efficient Fine-Tuning of Image-Conditional Diffusion Models for Depth and Surface Normal Estimation}},
  author    = {Garcia, Gonzalo Martin and Zeid, Karim Abou and Schmidt, Christian and de Geus, Daan and Hermans, Alexander and Leibe, Bastian},
  booktitle = {NeurIPS 2024 Workshops: AFM},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/garcia2024neuripsw-efficient/}
}