An End-to-End Transformer Model for 3D Object Detection

Abstract

We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D-specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3D-specific operators with hand-tuned hyperparameters. Nevertheless, 3DETR is conceptually simple and easy to implement, enabling further improvements by incorporating 3D domain knowledge. Through extensive experiments, we show 3DETR outperforms the well-established and highly optimized VoteNet baselines on the challenging ScanNetV2 dataset by 9.5%. Furthermore, we show 3DETR is applicable to 3D tasks beyond detection, and can serve as a building block for future research.

Cite

Text

Misra et al. "An End-to-End Transformer Model for 3D Object Detection." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00290

Markdown

[Misra et al. "An End-to-End Transformer Model for 3D Object Detection." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/misra2021iccv-endtoend/) doi:10.1109/ICCV48922.2021.00290

BibTeX

@inproceedings{misra2021iccv-endtoend,
  title     = {{An End-to-End Transformer Model for 3D Object Detection}},
  author    = {Misra, Ishan and Girdhar, Rohit and Joulin, Armand},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {2906-2917},
  doi       = {10.1109/ICCV48922.2021.00290},
  url       = {https://mlanthology.org/iccv/2021/misra2021iccv-endtoend/}
}