Ex-VAD: Explainable Fine-Grained Video Anomaly Detection Based on Visual-Language Models
Abstract
With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to fine-grained categorization and multidimensional analysis. However, existing methods focus mainly on coarse-grained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model’s explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model’s anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.
Cite
Text
Huang et al. "Ex-VAD: Explainable Fine-Grained Video Anomaly Detection Based on Visual-Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Huang et al. "Ex-VAD: Explainable Fine-Grained Video Anomaly Detection Based on Visual-Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/huang2025icml-exvad/)BibTeX
@inproceedings{huang2025icml-exvad,
title = {{Ex-VAD: Explainable Fine-Grained Video Anomaly Detection Based on Visual-Language Models}},
author = {Huang, Chao and Shi, Yushu and Wen, Jie and Wang, Wei and Xu, Yong and Cao, Xiaochun},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {25750-25761},
volume = {267},
url = {https://mlanthology.org/icml/2025/huang2025icml-exvad/}
}