Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding
Abstract
3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary which can be restrictive. To address this issue we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this we design a visual program that consists of three types of modules i.e. view-independent view-dependent and functional modules. Furthermore we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/ZSVG3D.
Cite
Text
Yuan et al. "Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01949Markdown
[Yuan et al. "Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yuan2024cvpr-visual/) doi:10.1109/CVPR52733.2024.01949BibTeX
@inproceedings{yuan2024cvpr-visual,
title = {{Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding}},
author = {Yuan, Zhihao and Ren, Jinke and Feng, Chun-Mei and Zhao, Hengshuang and Cui, Shuguang and Li, Zhen},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {20623-20633},
doi = {10.1109/CVPR52733.2024.01949},
url = {https://mlanthology.org/cvpr/2024/yuan2024cvpr-visual/}
}