MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer
Abstract
Monocular 3D object detection is an important yet challenging task in autonomous driving. Some existing methods leverage depth information from an off-the-shelf depth estimator to assist 3D detection, but suffer from the additional computational burden and achieve limited performance caused by inaccurate depth priors. To alleviate this, we propose MonoDTR, a novel end-to-end depth-aware transformer network for monocular 3D object detection. It mainly consists of two components: (1) the Depth-Aware Feature Enhancement (DFE) module that implicitly learns depth-aware features with auxiliary supervision without requiring extra computation, and (2) the Depth-Aware Transformer (DTR) module that globally integrates context- and depth-aware features. Moreover, different from conventional pixel-wise positional encodings, we introduce a novel depth positional encoding (DPE) to inject depth positional hints into transformers. Our proposed depth-aware modules can be easily plugged into existing image-only monocular 3D object detectors to improve the performance. Extensive experiments on the KITTI dataset demonstrate that our approach outperforms previous state-of-the-art monocular-based methods and achieves real-time detection. Code is available at https://github.com/kuanchihhuang/MonoDTR.
Cite
Text
Huang et al. "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00398Markdown
[Huang et al. "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/huang2022cvpr-monodtr/) doi:10.1109/CVPR52688.2022.00398BibTeX
@inproceedings{huang2022cvpr-monodtr,
title = {{MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer}},
author = {Huang, Kuan-Chih and Wu, Tsung-Han and Su, Hung-Ting and Hsu, Winston H.},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {4012-4021},
doi = {10.1109/CVPR52688.2022.00398},
url = {https://mlanthology.org/cvpr/2022/huang2022cvpr-monodtr/}
}