Promptable 3-D Object Localization with Latent Diffusion Models

Abstract

Accurate identification and localization of objects in 3-D scenes are essential for advancing comprehensive 3-D scene understanding. Although diffusion models have demonstrated impressive capabilities across a broad spectrum of computer vision tasks, their potential in both 2-D and 3-D object detection remains underexplored. Existing approaches typically formulate detection as a ''noise-to-box'' process, but they rely heavily on direct coordinate regression, which limits adaptability for more advanced tasks such as grounding-based object detection. To overcome these challenges, we propose a promptable 3-D object recognition framework, which introduces a diffusion-based paradigm for flexible and conditionally guided 3-D object detection. Our approach encodes bounding boxes into latent representations and employs latent diffusion models to realize a ''promptable noise-to-box'' transformation. This formulation enables the refinement of standard 3-D object detection using textual prompts, such as class labels. Moreover, it naturally extends to grounding object detection through conditioning on natural language descriptions, and generalizes effectively to few-shot learning by incorporating annotated exemplars as visual prompts. We conduct thorough evaluations on three key 3-D object recognition tasks: general 3-D object detection, few-shot detection, and grounding-based detection. Experimental results demonstrate that our framework achieves competitive performance relative to state-of-the-art methods, validating its effectiveness, versatility, and broad applicability in 3-D computer vision.

Cite

Text

Hong et al. "Promptable 3-D Object Localization with Latent Diffusion Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Hong et al. "Promptable 3-D Object Localization with Latent Diffusion Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/hong2025neurips-promptable/)

BibTeX

@inproceedings{hong2025neurips-promptable,
  title     = {{Promptable 3-D Object Localization with Latent Diffusion Models}},
  author    = {Hong, Cheng-Yao and Wang, Li-Heng and Liu, Tyng-Luh},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/hong2025neurips-promptable/}
}