Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

Abstract

The goal of open-vocabulary detection is to identify novel objects based on arbitrary textual descriptions. In this paper, we address open-vocabulary 3D point-cloud detection by a dividing-and-conquering strategy, which involves: 1) developing a point-cloud detector that can learn a general representation for localizing various objects, and 2) connecting textual and point-cloud representations to enable the detector to classify novel object categories based on text prompting. Specifically, we resort to rich image pre-trained models, by which the point-cloud detector learns localizing objects under the supervision of predicted 2D bounding boxes from 2D pre-trained detectors. Moreover, we propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text, thereby enabling the point-cloud detector to benefit from vision-language pre-trained models, i.e., CLIP. The novel use of image and vision-language pre-trained models for point-cloud detectors allows for open-vocabulary 3D object detection without the need for 3D annotations. Experiments demonstrate that the proposed method improves at least 3.03 points and 7.47 points over a wide range of baselines on the ScanNet and SUN RGB-D datasets, respectively. Furthermore, we provide a comprehensive analysis to explain why our approach works.

Cite

Text

Lu et al. "Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00121

Markdown

[Lu et al. "Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/lu2023cvpr-openvocabulary/) doi:10.1109/CVPR52729.2023.00121

BibTeX

@inproceedings{lu2023cvpr-openvocabulary,
  title     = {{Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation}},
  author    = {Lu, Yuheng and Xu, Chenfeng and Wei, Xiaobao and Xie, Xiaodong and Tomizuka, Masayoshi and Keutzer, Kurt and Zhang, Shanghang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {1190-1199},
  doi       = {10.1109/CVPR52729.2023.00121},
  url       = {https://mlanthology.org/cvpr/2023/lu2023cvpr-openvocabulary/}
}