PV-Cap: 3D Dynamic Scene Understanding Through Open Physics-Based Vocabulary

Abstract

Recently, large Vision Language (VL) models, i.e., CLIP, have demonstrated impressive capabilities in training solely on internet-scale image-language pairs. Moreover, almost all VL models have tackled indoor objects under controlled illumination and camera views. However, outdoor 3D environments are time-varying uncontrolled scenes under natural phenomena. Therefore, captions from such unseen scenes and objects are hard to obtain in a state-of-the-art (SOTA) one-shot algorithm, resulting in insufficient captions. This paper proposes PV-Cap (Physics-based Vocabulary for Caption) for enhancing 3D scene understanding through enriched captions. Since many tasks in understanding 3D dynamic scenes are hard to deal with, PVCap aims to disentangle such complexities through multiple grouped Deep Learning and Vision Language models step-wisely. Proposed i-VQA (iterative VQA) and 3D-CPP (3D Contrastive Physical-Scale Pretraining) extended from SOTA 2D-CLIP also contribute to generating physical and 3D-based captions. Using many images with 3D dynamic events, i.e., road scenes with traffic flow and accidents, experiments have demonstrated the usability and effectiveness of proposed PV-Cap over SOTA models in terms of segmentation and captions.

Cite

Text

Sakaino et al. "PV-Cap: 3D Dynamic Scene Understanding Through Open Physics-Based Vocabulary." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00791

Markdown

[Sakaino et al. "PV-Cap: 3D Dynamic Scene Understanding Through Open Physics-Based Vocabulary." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/sakaino2024cvprw-pvcap/) doi:10.1109/CVPRW63382.2024.00791

BibTeX

@inproceedings{sakaino2024cvprw-pvcap,
  title     = {{PV-Cap: 3D Dynamic Scene Understanding Through Open Physics-Based Vocabulary}},
  author    = {Sakaino, Hidetomo and Phuong, Thao Nguyen and Duy, Vinh Nguyen},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7932-7942},
  doi       = {10.1109/CVPRW63382.2024.00791},
  url       = {https://mlanthology.org/cvprw/2024/sakaino2024cvprw-pvcap/}
}