Uni3DL: A Unified Model for 3D Vision-Language Understanding

Abstract

We present Uni3DL, a unified model for 3D Vision-Language understanding. Distinct from existing unified 3D vision-language models that mostly rely on projected multi-view images and support limited tasks, Uni3DL operates directly on point clouds and significantly broadens the spectrum of tasks in the 3D domain, encompassing both vision and vision-language tasks. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively produce task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D vision-language understanding. Project page: https://uni3dl.github.io/.

Cite

Text

Li et al. "Uni3DL: A Unified Model for 3D Vision-Language Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73337-6_5

Markdown

[Li et al. "Uni3DL: A Unified Model for 3D Vision-Language Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/li2024eccv-uni3dl/) doi:10.1007/978-3-031-73337-6_5

BibTeX

@inproceedings{li2024eccv-uni3dl,
  title     = {{Uni3DL: A Unified Model for 3D Vision-Language Understanding}},
  author    = {Li, Xiang and Ding, Jian and Chen, Zhaoyang and Elhoseiny, Mohamed},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73337-6_5},
  url       = {https://mlanthology.org/eccv/2024/li2024eccv-uni3dl/}
}