UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
Abstract
Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks. Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding.
Cite
Text
Chen et al. "UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01660Markdown
[Chen et al. "UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/chen2023iccv-unit3d/) doi:10.1109/ICCV51070.2023.01660BibTeX
@inproceedings{chen2023iccv-unit3d,
title = {{UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding}},
author = {Chen, Zhenyu and Hu, Ronghang and Chen, Xinlei and Nießner, Matthias and Chang, Angel X.},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {18109-18119},
doi = {10.1109/ICCV51070.2023.01660},
url = {https://mlanthology.org/iccv/2023/chen2023iccv-unit3d/}
}