Just Dance with Pi! a Poly-Modal Inductor for Weakly-Supervised Video Anomaly Detection

Majhi, Snehashis; D'Amicantonio, Giacomo; Dantcheva, Antitza; Kong, Quan; Garattoni, Lorenzo; Francesca, Gianpiero; Bondarev, Egor; Bremond, Francois

doi:10.1109/CVPR52734.2025.02260

Just Dance with Pi! a Poly-Modal Inductor for Weakly-Supervised Video Anomaly Detection

Snehashis Majhi, Giacomo D'Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev, Francois Bremond

CVPR 2025 pp. 24265-24274

doi:10.1109/CVPR52734.2025.02260 /cvpr/2025/majhi2025cvpr-just/

Abstract

Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: PI-VAD (or \pi-VAD), a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. \pi-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, \pi-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.

PDF CVPR Semantic Scholar

Cite

Text

Majhi et al. "Just Dance with Pi! a Poly-Modal Inductor for Weakly-Supervised Video Anomaly Detection." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02260

Markdown

[Majhi et al. "Just Dance with Pi! a Poly-Modal Inductor for Weakly-Supervised Video Anomaly Detection." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/majhi2025cvpr-just/) doi:10.1109/CVPR52734.2025.02260

BibTeX

@inproceedings{majhi2025cvpr-just,
  title     = {{Just Dance with Pi! a Poly-Modal Inductor for Weakly-Supervised Video Anomaly Detection}},
  author    = {Majhi, Snehashis and D'Amicantonio, Giacomo and Dantcheva, Antitza and Kong, Quan and Garattoni, Lorenzo and Francesca, Gianpiero and Bondarev, Egor and Bremond, Francois},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {24265-24274},
  doi       = {10.1109/CVPR52734.2025.02260},
  url       = {https://mlanthology.org/cvpr/2025/majhi2025cvpr-just/}
}