DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
Abstract
Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at https://github.com/Haiyang-W/DSVT.
Cite
Text
Wang et al. "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01299Markdown
[Wang et al. "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/wang2023cvpr-dsvt/) doi:10.1109/CVPR52729.2023.01299BibTeX
@inproceedings{wang2023cvpr-dsvt,
title = {{DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets}},
author = {Wang, Haiyang and Shi, Chen and Shi, Shaoshuai and Lei, Meng and Wang, Sen and He, Di and Schiele, Bernt and Wang, Liwei},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {13520-13529},
doi = {10.1109/CVPR52729.2023.01299},
url = {https://mlanthology.org/cvpr/2023/wang2023cvpr-dsvt/}
}