SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection

Abstract

In recent years, vision-language models such as CLIP and VideoLLaMA have demonstrated the ability to express visual data in semantically rich textual representations, making them highly effective for downstream tasks. Given their cross-modal semantic representation power, leveraging such models for video anomaly detection (VAD) holds significant promise. In this work, we introduce Semantic VAD (SemVAD), a novel methodology for weakly super- vised video anomaly detection (wVAD) that effectively fuses visual and semantic features obtained from pretrained vision-language models, specifically VideoLLaMA 3 and CLIP. Our approach enhances performance and explainability in anomaly detection. Additionally, we analyze the sensitivity of recent state-of-the-art models to randomness in training initial- ization and introduce a more comprehensive evaluation framework to assess their robustness to small changes in training such as the seed of random number generator. This framework aims to provide a more rigorous and holistic assessment of model performance, ensuring a deeper understanding of their reliability and reproducibility in wVAD.

Cite

Text

Karim and Yilmaz. "SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection." Transactions on Machine Learning Research, 2026.

Markdown

[Karim and Yilmaz. "SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/karim2026tmlr-semvad/)

BibTeX

@article{karim2026tmlr-semvad,
  title     = {{SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection}},
  author    = {Karim, Hamza and Yilmaz, Yasin},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/karim2026tmlr-semvad/}
}