SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection
Abstract
In recent years, vision-language models such as CLIP and VideoLLaMA have demonstrated the ability to express visual data in semantically rich textual representations, making them highly effective for downstream tasks. Given their cross-modal semantic representation power, leveraging such models for video anomaly detection (VAD) holds significant promise. In this work, we introduce Semantic VAD (SemVAD), a novel methodology for weakly super- vised video anomaly detection (wVAD) that effectively fuses visual and semantic features obtained from pretrained vision-language models, specifically VideoLLaMA 3 and CLIP. Our approach enhances performance and explainability in anomaly detection. Additionally, we analyze the sensitivity of recent state-of-the-art models to randomness in training initial- ization and introduce a more comprehensive evaluation framework to assess their robustness to small changes in training such as the seed of random number generator. This framework aims to provide a more rigorous and holistic assessment of model performance, ensuring a deeper understanding of their reliability and reproducibility in wVAD.
Cite
Text
Karim and Yilmaz. "SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection." Transactions on Machine Learning Research, 2026.Markdown
[Karim and Yilmaz. "SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/karim2026tmlr-semvad/)BibTeX
@article{karim2026tmlr-semvad,
title = {{SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection}},
author = {Karim, Hamza and Yilmaz, Yasin},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/karim2026tmlr-semvad/}
}