Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

Abstract

In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. It achieves 74.1% NDS (state-of-the-art with single model) on nuScenes test set while maintaining faster inference speed. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code is released at https: //github.com/junjie18/CMT .

Cite

Text

Yan et al. "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01675

Markdown

[Yan et al. "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/yan2023iccv-cross/) doi:10.1109/ICCV51070.2023.01675

BibTeX

@inproceedings{yan2023iccv-cross,
  title     = {{Cross Modal Transformer: Towards Fast and Robust 3D Object Detection}},
  author    = {Yan, Junjie and Liu, Yingfei and Sun, Jianjian and Jia, Fan and Li, Shuailin and Wang, Tiancai and Zhang, Xiangyu},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {18268-18278},
  doi       = {10.1109/ICCV51070.2023.01675},
  url       = {https://mlanthology.org/iccv/2023/yan2023iccv-cross/}
}