Multi-Perspective Traffic Video Description Model with Fine-Grained Refinement Approach

Abstract

Analyzing traffic patterns is crucial for enhancing safety and optimizing flow within urban cities. While urban cities possess extensive camera networks for monitoring, the raw video data often lacks the contextual detail necessary for understanding complex traffic incidents and the behaviors of road users. In this paper, we propose a novel methodology for generating comprehensive descriptions of traffic scenarios, combining a vision-language model with rule-based refinements to capture pertinently pedestrian, vehicle, and environment factors. First, a captioning model will generate a general description using processed video as input. Subsequently, this description is refined sequentially through three primary modules: pedestrian-aware, vehicle-aware, and context-aware, enhancing the final description. We evaluate our method on the Woven Traffic Safety datasets in Track 2 of the AI City Challenge 2024, obtaining competitive results with an S2 score of 22.6721. Code will be available at https://github.com/ToTuanAn/AICityChallenge2024_Track2

Cite

Text

To et al. "Multi-Perspective Traffic Video Description Model with Fine-Grained Refinement Approach." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00701

Markdown

[To et al. "Multi-Perspective Traffic Video Description Model with Fine-Grained Refinement Approach." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/to2024cvprw-multiperspective/) doi:10.1109/CVPRW63382.2024.00701

BibTeX

@inproceedings{to2024cvprw-multiperspective,
  title     = {{Multi-Perspective Traffic Video Description Model with Fine-Grained Refinement Approach}},
  author    = {To, Tuan-An and Tran, Minh-Nam and Ho, Trong-Bao and Ha, Thien-Loc and Nguyen, Quang-Tan and Luong, Hoang-Chau and Cao, Thanh-Duy and Tran, Minh-Triet},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7075-7084},
  doi       = {10.1109/CVPRW63382.2024.00701},
  url       = {https://mlanthology.org/cvprw/2024/to2024cvprw-multiperspective/}
}