TrajPrompt: Aligning Color Trajectory with Vision-Language Representations

Abstract

Cross-modal learning shows promising potential to overcome the limitations of single-modality tasks. However, without proper design for representation alignment between different data sources, the external modality cannot fully exhibit its value. For example, recent trajectory prediction approaches incorporate the Bird’s-Eye-View (BEV) scene as an additional source but do not significantly improve performance compared to single-source strategies, indicating that the BEV scene and trajectory representations are not effectively combined. To overcome this problem, we propose TrajPrompt, a prompt-based approach that seamlessly incorporates trajectory representation into the vision-language framework, CLIP, for the BEV scene understanding and future forecasting. We discover that CLIP can attend to the local area of the BEV scene by utilizing our innovative design of text prompts and colored lines. Comprehensive results demonstrate TrajPrompt’s effectiveness via outperforming the state-of-the-art trajectory predictors by a significant margin (over 35% improvement for ADE and FDE metrics on SDD and DroneCrowd dataset), using fewer learnable parameters than the previous trajectory modeling approaches with scene information included. Project page: https://trajprompt.github.io/

Cite

Text

Tsao et al. "TrajPrompt: Aligning Color Trajectory with Vision-Language Representations." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72940-9_16

Markdown

[Tsao et al. "TrajPrompt: Aligning Color Trajectory with Vision-Language Representations." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/tsao2024eccv-trajprompt/) doi:10.1007/978-3-031-72940-9_16

BibTeX

@inproceedings{tsao2024eccv-trajprompt,
  title     = {{TrajPrompt: Aligning Color Trajectory with Vision-Language Representations}},
  author    = {Tsao, Li-Wu and Tsui, Hao-Tang and Tuan, Yu-Rou and Chen, Pei-Chi and Wang, Kuan-Lin and Wu, Jhih-Ciang and Shuai, Hong-Han and Cheng, Wen-Huang},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72940-9_16},
  url       = {https://mlanthology.org/eccv/2024/tsao2024eccv-trajprompt/}
}