Unifying 3D Vision-Language Understanding via Promptable Queries
Abstract
A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce , a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (, voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% ([email protected]), Multi3DRefer by 11.7% ([email protected]), and Scan2Cap by 13.4% ([email protected]). Moreover, supports flexible inference with individual or combined forms of available 3D representations, , solely voxel input.
Cite
Text
Zhu et al. "Unifying 3D Vision-Language Understanding via Promptable Queries." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72784-9_11Markdown
[Zhu et al. "Unifying 3D Vision-Language Understanding via Promptable Queries." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhu2024eccv-unifying/) doi:10.1007/978-3-031-72784-9_11BibTeX
@inproceedings{zhu2024eccv-unifying,
title = {{Unifying 3D Vision-Language Understanding via Promptable Queries}},
author = {Zhu, Ziyu and Zhang, Zhuofan and Ma, Xiaojian and Niu, Xuesong and Chen, Yixin and Jia, Baoxiong and Deng, Zhidong and Huang, Siyuan and Li, Qing},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72784-9_11},
url = {https://mlanthology.org/eccv/2024/zhu2024eccv-unifying/}
}