Cross-Modal 3D Representation with Multi-View Images and Point Clouds

Abstract

The advancement of 3D understanding and representation is a crucial step for the next phase of autonomous driving, robotics, augmented and virtual reality, 3D gaming and 3D e-commerce products. However, existing 3D semantic representation research has primarily focused on point clouds to perceive 3D objects and scenes, overlooking the rich visual details offered by multi-view images, thereby limiting the potential of 3D semantic representation. This paper introduces OpenView, a novel representation method that integrates both point clouds and multi-view images to form a unified 3D representation. OpenView comprises a unique fusion framework, sequence-independent modeling, a cross-modal fusion encoder, and a progressive hard learning strategy. Our experiments demonstrate that OpenView outperforms the state-of-the-art by 11.5% and 5.5% on the R@1 metric for cross-modal retrieval and the Top-1 metric for zero-shot classification tasks, respectively. Furthermore, we showcase some applications of OpenView: 3D retrieval, 3D captioning and hierarchical data clustering, highlighting its generality in the field of 3D representation learning.

Cite

Text

Zhou et al. "Cross-Modal 3D Representation with Multi-View Images and Point Clouds." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00353

Markdown

[Zhou et al. "Cross-Modal 3D Representation with Multi-View Images and Point Clouds." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhou2025cvpr-crossmodal/) doi:10.1109/CVPR52734.2025.00353

BibTeX

@inproceedings{zhou2025cvpr-crossmodal,
  title     = {{Cross-Modal 3D Representation with Multi-View Images and Point Clouds}},
  author    = {Zhou, Ziyang and Wang, Pinghui and Liang, Zi and Bai, Haitao and Zhang, Ruofei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3728-3739},
  doi       = {10.1109/CVPR52734.2025.00353},
  url       = {https://mlanthology.org/cvpr/2025/zhou2025cvpr-crossmodal/}
}