Harnessing Large Language Models for Training-Free Video Anomaly Detection

Abstract

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision one-class supervision or in an unsupervised setting. Training-based methods are prone to be domain-specific thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD) a method tackling VAD in a novel training-free paradigm exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence) showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Cite

Text

Zanella et al. "Harnessing Large Language Models for Training-Free Video Anomaly Detection." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01753

Markdown

[Zanella et al. "Harnessing Large Language Models for Training-Free Video Anomaly Detection." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/zanella2024cvpr-harnessing/) doi:10.1109/CVPR52733.2024.01753

BibTeX

@inproceedings{zanella2024cvpr-harnessing,
  title     = {{Harnessing Large Language Models for Training-Free Video Anomaly Detection}},
  author    = {Zanella, Luca and Menapace, Willi and Mancini, Massimiliano and Wang, Yiming and Ricci, Elisa},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {18527-18536},
  doi       = {10.1109/CVPR52733.2024.01753},
  url       = {https://mlanthology.org/cvpr/2024/zanella2024cvpr-harnessing/}
}