Three Pillars Improving Vision Foundation Model Distillation for LiDAR

Abstract

Self-supervised image backbones can be used to address complex 2D tasks (e.g. semantic segmentation object discovery) very efficiently and with little or no downstream supervision. Ideally 3D backbones for lidar should be able to inherit these properties after distillation of these powerful 2D features. The most recent methods for image-to-lidar distillation on autonomous driving data show promising results obtained thanks to distillation methods that keep improving. Yet we still notice a large performance gap when measuring by linear probing the quality of distilled vs fully supervised features. In this work instead of focusing only on the distillation method we study the effect of three pillars for distillation: the 3D backbone the pretrained 2D backbone and the pretraining 2D+3D dataset. In particular thanks to our scalable distillation method named ScaLR we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality. This allows us to significantly reduce the gap between the quality of distilled and fully-supervised 3D features and to improve the robustness of the pretrained backbones to domain gaps and perturbations.

Cite

Text

Puy et al. "Three Pillars Improving Vision Foundation Model Distillation for LiDAR." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02033

Markdown

[Puy et al. "Three Pillars Improving Vision Foundation Model Distillation for LiDAR." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/puy2024cvpr-three/) doi:10.1109/CVPR52733.2024.02033

BibTeX

@inproceedings{puy2024cvpr-three,
  title     = {{Three Pillars Improving Vision Foundation Model Distillation for LiDAR}},
  author    = {Puy, Gilles and Gidaris, Spyros and Boulch, Alexandre and Siméoni, Oriane and Sautier, Corentin and Pérez, Patrick and Bursuc, Andrei and Marlet, Renaud},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {21519-21529},
  doi       = {10.1109/CVPR52733.2024.02033},
  url       = {https://mlanthology.org/cvpr/2024/puy2024cvpr-three/}
}