UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Abstract

Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR.

Cite

Text

Wang et al. "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation." International Conference on Computer Vision, 2023.

Markdown

[Wang et al. "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/wang2023iccv-unitr/)

BibTeX

@inproceedings{wang2023iccv-unitr,
  title     = {{UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation}},
  author    = {Wang, Haiyang and Tang, Hao and Shi, Shaoshuai and Li, Aoxue and Li, Zhenguo and Schiele, Bernt and Wang, Liwei},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {6792-6802},
  url       = {https://mlanthology.org/iccv/2023/wang2023iccv-unitr/}
}