A Unified Multi-Modal Structure for Retrieving Tracked Vehicles Through Natural Language Descriptions

Abstract

Through the development of multi-modal and contrastive learning, image and video retrieval have made immense progress over the last years. Organically fused text, image, and video knowledge brings huge potential opportunities for multi-dimension, and multi-view retrieval, especially in traffic senses. This paper proposes a novel Multimodal Language Vehicle Retrieval (MLVR) system, for retrieving the trajectory of tracked vehicles based on natural language descriptions. The MLVR system is mainly combined with an end-to-end text-video contrastive learning model, a CLIP few-shot domain adaption method, and a semi-centralized control optimization system. Through a comprehensive understanding the knowledge from the vehicle type, color, maneuver, and surrounding environment, the MLVR forms a robust method to recognize an effective trajectory with provided natural language descriptions. Under this structure, our approach has achieved 81.79% Mean Reciprocal Rank (MRR) accuracy on the test dataset, in the 7th AI City Challenge Track 2, Tracked-Vehicle Retrieval by Natural Language Descriptions, rendering the 2nd rank on the public leaderboard. Our code is available at https://github.com/eadst/MLVR.

Cite

Text

Xie et al. "A Unified Multi-Modal Structure for Retrieving Tracked Vehicles Through Natural Language Descriptions." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023. doi:10.1109/CVPRW59228.2023.00572

Markdown

[Xie et al. "A Unified Multi-Modal Structure for Retrieving Tracked Vehicles Through Natural Language Descriptions." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023.](https://mlanthology.org/cvprw/2023/xie2023cvprw-unified/) doi:10.1109/CVPRW59228.2023.00572

BibTeX

@inproceedings{xie2023cvprw-unified,
  title     = {{A Unified Multi-Modal Structure for Retrieving Tracked Vehicles Through Natural Language Descriptions}},
  author    = {Xie, Dong and Liu, Linhu and Zhang, Shengjun and Tian, Jiang},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2023},
  pages     = {5419-5427},
  doi       = {10.1109/CVPRW59228.2023.00572},
  url       = {https://mlanthology.org/cvprw/2023/xie2023cvprw-unified/}
}