Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

Abstract

Transformer has demonstrated promising performance in many 2D vision tasks. However, it is cumbersome to apply the self-attention underlying transformer on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. To solve this issue, existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. However, the former results in stochastic point dropout, while the latter typically has narrow attention field. In this paper, we propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation. VoxSeT is built upon a voxel-based set attention (VSA) module, which reduces the self-attention in each voxel by two cross-attentions and models features in a hidden space induced by a group of latent codes. With the VSA module, VoxSeT can manage voxelized point clusters with arbitrary size in a wide range, and process them in parallel with linear complexity. The proposed VoxSeT integrates the high performance of transformer with the efficiency of voxel-based model, which can be used as a good alternative to the convolutional and point-based backbones. VoxSeT reports competitive results on the KITTI and Waymo detection benchmarks. The source code of VoxSeT will be released.

Cite

Text

He et al. "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00823

Markdown

[He et al. "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/he2022cvpr-voxel/) doi:10.1109/CVPR52688.2022.00823

BibTeX

@inproceedings{he2022cvpr-voxel,
  title     = {{Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds}},
  author    = {He, Chenhang and Li, Ruihuang and Li, Shuai and Zhang, Lei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {8417-8427},
  doi       = {10.1109/CVPR52688.2022.00823},
  url       = {https://mlanthology.org/cvpr/2022/he2022cvpr-voxel/}
}