MsSVT: Mixed-Scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

Shaocong Dong, Lihe Ding, Haiyang Wang, Tingfa Xu, Xinli Xu, Jie Wang, Ziyang Bian, Ying Wang, Jianan Li

NeurIPS 2022

/neurips/2022/dong2022neurips-mssvt/

Abstract

3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Dong et al. "MsSVT: Mixed-Scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds." Neural Information Processing Systems, 2022.

Markdown

[Dong et al. "MsSVT: Mixed-Scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/dong2022neurips-mssvt/)

BibTeX

@inproceedings{dong2022neurips-mssvt,
  title     = {{MsSVT: Mixed-Scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds}},
  author    = {Dong, Shaocong and Ding, Lihe and Wang, Haiyang and Xu, Tingfa and Xu, Xinli and Wang, Jie and Bian, Ziyang and Wang, Ying and Li, Jianan},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/dong2022neurips-mssvt/}
}