FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection

Zhang, Dongmei; Li, Chang; Zhang, Renrui; Xie, Shenghao; Xue, Wei; Xie, Xiaodong; Zhang, Shanghang

doi:10.1609/AAAI.V38I15.29612

FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection

Dongmei Zhang, Chang Li, Renrui Zhang, Shenghao Xie, Wei Xue, Xiaodong Xie, Shanghang Zhang

AAAI 2024 pp. 16723-16731

doi:10.1609/AAAI.V38I15.29612 /aaai/2024/zhang2024aaai-fm/

Abstract

The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.

PDF AAAI Semantic Scholar

Cite

Text

Zhang et al. "FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I15.29612

Markdown

[Zhang et al. "FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/zhang2024aaai-fm/) doi:10.1609/AAAI.V38I15.29612

BibTeX

@inproceedings{zhang2024aaai-fm,
  title     = {{FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection}},
  author    = {Zhang, Dongmei and Li, Chang and Zhang, Renrui and Xie, Shenghao and Xue, Wei and Xie, Xiaodong and Zhang, Shanghang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {16723-16731},
  doi       = {10.1609/AAAI.V38I15.29612},
  url       = {https://mlanthology.org/aaai/2024/zhang2024aaai-fm/}
}