Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
Abstract
3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks. Despite the encouraging progress in vision-language pre-training for image-text data, 3D-language pre-training is still an open issue due to limited 3D-language paired data, highly sparse and irregular structure of point clouds and ambiguities in spatial relations of 3D objects with viewpoint changes. In this paper, we present a generic 3D-language pre-training approach, that tackles multiple facets of 3D-language reasoning by learning universal representations. Our learning objective constitutes two main parts. 1) Context aware spatial-semantic alignment to establish fine-grained correspondence between point clouds and texts. It reduces relational ambiguities by aligning 3D spatial relationships with textual semantic context. 2) Mutual 3D-Language Masked modeling to enable cross-modality information exchange. Instead of reconstructing sparse 3D points for which language can hardly provide cues, we propose masked proposal reasoning to learn semantic class and mask-invariant representations. Our proposed 3D-language pre-training method achieves promising results once adapted to various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering. Our codes are available at https://github.com/leolyj/3D-VLP
Cite
Text
Jin et al. "Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01057Markdown
[Jin et al. "Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/jin2023cvpr-contextaware/) doi:10.1109/CVPR52729.2023.01057BibTeX
@inproceedings{jin2023cvpr-contextaware,
title = {{Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training}},
author = {Jin, Zhao and Hayat, Munawar and Yang, Yuwei and Guo, Yulan and Lei, Yinjie},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {10984-10994},
doi = {10.1109/CVPR52729.2023.01057},
url = {https://mlanthology.org/cvpr/2023/jin2023cvpr-contextaware/}
}