Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding

Abstract

3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary which can be restrictive. To address this issue we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this we design a visual program that consists of three types of modules i.e. view-independent view-dependent and functional modules. Furthermore we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/ZSVG3D.

Cite

Text

Yuan et al. "Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01949

Markdown

[Yuan et al. "Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yuan2024cvpr-visual/) doi:10.1109/CVPR52733.2024.01949

BibTeX

@inproceedings{yuan2024cvpr-visual,
  title     = {{Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding}},
  author    = {Yuan, Zhihao and Ren, Jinke and Feng, Chun-Mei and Zhao, Hengshuang and Cui, Shuguang and Li, Zhen},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {20623-20633},
  doi       = {10.1109/CVPR52733.2024.01949},
  url       = {https://mlanthology.org/cvpr/2024/yuan2024cvpr-visual/}
}