Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training

Abstract

3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks. Despite the encouraging progress in vision-language pre-training for image-text data, 3D-language pre-training is still an open issue due to limited 3D-language paired data, highly sparse and irregular structure of point clouds and ambiguities in spatial relations of 3D objects with viewpoint changes. In this paper, we present a generic 3D-language pre-training approach, that tackles multiple facets of 3D-language reasoning by learning universal representations. Our learning objective constitutes two main parts. 1) Context aware spatial-semantic alignment to establish fine-grained correspondence between point clouds and texts. It reduces relational ambiguities by aligning 3D spatial relationships with textual semantic context. 2) Mutual 3D-Language Masked modeling to enable cross-modality information exchange. Instead of reconstructing sparse 3D points for which language can hardly provide cues, we propose masked proposal reasoning to learn semantic class and mask-invariant representations. Our proposed 3D-language pre-training method achieves promising results once adapted to various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering. Our codes are available at https://github.com/leolyj/3D-VLP

Cite

Text

Jin et al. "Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01057

Markdown

[Jin et al. "Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/jin2023cvpr-contextaware/) doi:10.1109/CVPR52729.2023.01057

BibTeX

@inproceedings{jin2023cvpr-contextaware,
  title     = {{Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training}},
  author    = {Jin, Zhao and Hayat, Munawar and Yang, Yuwei and Guo, Yulan and Lei, Yinjie},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {10984-10994},
  doi       = {10.1109/CVPR52729.2023.01057},
  url       = {https://mlanthology.org/cvpr/2023/jin2023cvpr-contextaware/}
}