Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
Abstract
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. It achieves 74.1% NDS (state-of-the-art with single model) on nuScenes test set while maintaining faster inference speed. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code is released at https: //github.com/junjie18/CMT .
Cite
Text
Yan et al. "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01675Markdown
[Yan et al. "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/yan2023iccv-cross/) doi:10.1109/ICCV51070.2023.01675BibTeX
@inproceedings{yan2023iccv-cross,
title = {{Cross Modal Transformer: Towards Fast and Robust 3D Object Detection}},
author = {Yan, Junjie and Liu, Yingfei and Sun, Jianjian and Jia, Fan and Li, Shuailin and Wang, Tiancai and Zhang, Xiangyu},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {18268-18278},
doi = {10.1109/ICCV51070.2023.01675},
url = {https://mlanthology.org/iccv/2023/yan2023iccv-cross/}
}