Structural Information Guided Multimodal Pre-Training for Vehicle-Centric Perception

Wang, Xiao; Wu, Wentao; Li, Chenglong; Zhao, Zhicheng; Chen, Zhe; Shi, Yukai; Tang, Jin

doi:10.1609/AAAI.V38I6.28373

Structural Information Guided Multimodal Pre-Training for Vehicle-Centric Perception

Xiao Wang, Wentao Wu, Chenglong Li, Zhicheng Zhao, Zhe Chen, Yukai Shi, Jin Tang

AAAI 2024 pp. 5624-5632

doi:10.1609/AAAI.V38I6.28373 /aaai/2024/wang2024aaai-structural/

Abstract

Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE.

PDF AAAI Semantic Scholar

Cite

Text

Wang et al. "Structural Information Guided Multimodal Pre-Training for Vehicle-Centric Perception." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I6.28373

Markdown

[Wang et al. "Structural Information Guided Multimodal Pre-Training for Vehicle-Centric Perception." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/wang2024aaai-structural/) doi:10.1609/AAAI.V38I6.28373

BibTeX

@inproceedings{wang2024aaai-structural,
  title     = {{Structural Information Guided Multimodal Pre-Training for Vehicle-Centric Perception}},
  author    = {Wang, Xiao and Wu, Wentao and Li, Chenglong and Zhao, Zhicheng and Chen, Zhe and Shi, Yukai and Tang, Jin},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {5624-5632},
  doi       = {10.1609/AAAI.V38I6.28373},
  url       = {https://mlanthology.org/aaai/2024/wang2024aaai-structural/}
}