Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Jiao, Pengkun; Zhao, Na; Chen, Jingjing; Jiang, Yu-Gang

doi:10.1007/978-3-031-73195-2_22

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang

ECCV 2024

doi:10.1007/978-3-031-73195-2_22 /eccv/2024/jiao2024eccv-unlocking/

Abstract

Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pretrained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios.

PDF ECCV Semantic Scholar

Cite

Text

Jiao et al. "Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73195-2_22

Markdown

[Jiao et al. "Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/jiao2024eccv-unlocking/) doi:10.1007/978-3-031-73195-2_22

BibTeX

@inproceedings{jiao2024eccv-unlocking,
  title     = {{Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image}},
  author    = {Jiao, Pengkun and Zhao, Na and Chen, Jingjing and Jiang, Yu-Gang},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73195-2_22},
  url       = {https://mlanthology.org/eccv/2024/jiao2024eccv-unlocking/}
}