LLaMAPed: Multi-Modal Pedestrian Crossing Intention Prediction

Abstract

A crucial technology in autonomous driving is the ability to predict whether a pedestrian will cross the road in the future, allowing autonomous vehicles to respond accordingly. Traditional methods have employed visual networks to predict pedestrian crossing intentions. However, these methods rely on trained datasets, resulting in a lack of generalization when faced with previously unseen driving scenarios. The advent of Multimodal Large Language Models (MLLMs), proficient in processing and reasoning with both text and images, offers a breakthrough new approach to overcome these challenges. In this paper, we propose LLaMAPed, the first study to apply the open-source MLLM, VideoLLaMA2, to predict pedestrian crossing intentions. We evaluated the performance of our method on the widely used JAAD dataset for pedestrian behavior prediction and compared its performance to traditional visual benchmark models and the closed-source GPT-4V. VideoLLaMA2, designed to enhance the understanding of spatial-temporal modeling, has been utilized in LLaMAPed to predict pedestrian crossing intentions in a zero-shot manner. LLaMAPed achieved a prediction accuracy of 58%, which is 1% higher than the performance of GPT-4V, which is commercial and close-sourced. We utilized multiple features as input for LLaMAPed and quantitatively analyzed the correlation of prediction performance by observation frames or adjusting the pedestrian ratio. In addition, we qualitatively demonstrated the prediction of pedestrian behavior in various urban scenarios.

Cite

Text

Ham et al. "LLaMAPed: Multi-Modal Pedestrian Crossing Intention Prediction." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91813-1_10

Markdown

[Ham et al. "LLaMAPed: Multi-Modal Pedestrian Crossing Intention Prediction." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/ham2024eccvw-llamaped/) doi:10.1007/978-3-031-91813-1_10

BibTeX

@inproceedings{ham2024eccvw-llamaped,
  title     = {{LLaMAPed: Multi-Modal Pedestrian Crossing Intention Prediction}},
  author    = {Ham, Je-Seok and Kim, Sunghun and Huang, Jia and Jiang, Peng and Moon, Jinyoung and Saripalli, Srikanth and Kim, Changick},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {150-167},
  doi       = {10.1007/978-3-031-91813-1_10},
  url       = {https://mlanthology.org/eccvw/2024/ham2024eccvw-llamaped/}
}