Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation

Abstract

We introduce Orchid, a unified latent diffusion model that learns a joint appearance-geometry prior to generate color, depth, and surface normal images in a single diffusion process. This unified approach is more efficient and coherent than current pipelines that use separate models for appearance and geometry. Orchid is versatile - it directly generates color, depth, and normal images from text, supports joint monocular depth and normal estimation with color-conditioned finetuning, and seamlessly inpaints large 3D regions by sampling from the joint distribution. It leverages a novel Variational Autoencoder (VAE) that jointly encodes RGB, relative depth, and surface normals into a shared latent space, combined with a latent diffusion model that denoises these latents. Our extensive experiments demonstrate that Orchid delivers competitive performance against SOTA task-specific methods for geometry prediction, even surpassing them in normal-prediction accuracy and depth-normal consistency. It also inpaints color-depth-normal images jointly, with more qualitative realism than existing multi-step methods.

Cite

Text

Krishnan et al. "Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation." International Conference on Computer Vision, 2025.

Markdown

[Krishnan et al. "Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/krishnan2025iccv-orchid/)

BibTeX

@inproceedings{krishnan2025iccv-orchid,
  title     = {{Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation}},
  author    = {Krishnan, Akshay and Yan, Xinchen and Casser, Vincent and Kundu, Abhijit},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {28217-28227},
  url       = {https://mlanthology.org/iccv/2025/krishnan2025iccv-orchid/}
}