MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency
Abstract
Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1% [email protected] on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach.
Cite
Text
Xu et al. "MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00426Markdown
[Xu et al. "MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/xu2023cvpr-mm3dscene/) doi:10.1109/CVPR52729.2023.00426BibTeX
@inproceedings{xu2023cvpr-mm3dscene,
title = {{MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency}},
author = {Xu, Mingye and Xu, Mutian and He, Tong and Ouyang, Wanli and Wang, Yali and Han, Xiaoguang and Qiao, Yu},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {4380-4390},
doi = {10.1109/CVPR52729.2023.00426},
url = {https://mlanthology.org/cvpr/2023/xu2023cvpr-mm3dscene/}
}