LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection

Abstract

LiDAR and camera are two common sensors to collect data in time for 3D object detection under the autonomous driving context. Though the complementary information across sensors and time has great potential of benefiting 3D perception, taking full advantage of sequential cross-sensor data still remains challenging. In this paper, we propose a novel LiDAR Image Fusion Transformer (LIFT) to model the mutual interaction relationship of cross-sensor data over time. LIFT learns to align the input 4D sequential cross-sensor data to achieve multi-frame multi-modal information aggregation. To alleviate computational load, we project both point clouds and images into the bird-eye-view maps to compute sparse grid-wise self-attention. LIFT also benefits from a cross-sensor and cross-time data augmentation scheme. We evaluate the proposed approach on the challenging nuScenes and Waymo datasets, where our LIFT performs well over the state-of-the-art and strong baselines.

Cite

Text

Zeng et al. "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01666

Markdown

[Zeng et al. "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/zeng2022cvpr-lift/) doi:10.1109/CVPR52688.2022.01666

BibTeX

@inproceedings{zeng2022cvpr-lift,
  title     = {{LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection}},
  author    = {Zeng, Yihan and Zhang, Da and Wang, Chunwei and Miao, Zhenwei and Liu, Ting and Zhan, Xin and Hao, Dayang and Ma, Chao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {17172-17181},
  doi       = {10.1109/CVPR52688.2022.01666},
  url       = {https://mlanthology.org/cvpr/2022/zeng2022cvpr-lift/}
}