MsSVT: Mixed-Scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

Abstract

3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.

Cite

Text

Dong et al. "MsSVT: Mixed-Scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds." Neural Information Processing Systems, 2022.

Markdown

[Dong et al. "MsSVT: Mixed-Scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/dong2022neurips-mssvt/)

BibTeX

@inproceedings{dong2022neurips-mssvt,
  title     = {{MsSVT: Mixed-Scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds}},
  author    = {Dong, Shaocong and Ding, Lihe and Wang, Haiyang and Xu, Tingfa and Xu, Xinli and Wang, Jie and Bian, Ziyang and Wang, Ying and Li, Jianan},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/dong2022neurips-mssvt/}
}