Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

Khurana, Mehar; Peri, Neehar; Hays, James; Ramanan, Deva

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

Mehar Khurana, Neehar Peri, James Hays, Deva Ramanan

CoRL 2024 pp. 2080-2103

/corl/2024/khurana2024corl-shelfsupervised/

Abstract

State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale image data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings.

PDF CoRL OpenReview Semantic Scholar

Cite

Text

Khurana et al. "Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection." Proceedings of The 8th Conference on Robot Learning, 2024.

Markdown

[Khurana et al. "Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection." Proceedings of The 8th Conference on Robot Learning, 2024.](https://mlanthology.org/corl/2024/khurana2024corl-shelfsupervised/)

BibTeX

@inproceedings{khurana2024corl-shelfsupervised,
  title     = {{Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection}},
  author    = {Khurana, Mehar and Peri, Neehar and Hays, James and Ramanan, Deva},
  booktitle = {Proceedings of The 8th Conference on Robot Learning},
  year      = {2024},
  pages     = {2080-2103},
  volume    = {270},
  url       = {https://mlanthology.org/corl/2024/khurana2024corl-shelfsupervised/}
}