Towards Training-Free Anomaly Detection with Vision and Language Foundation Models

Abstract

Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.

Cite

Text

Zhang et al. "Towards Training-Free Anomaly Detection with Vision and Language Foundation Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01416

Markdown

[Zhang et al. "Towards Training-Free Anomaly Detection with Vision and Language Foundation Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhang2025cvpr-trainingfree/) doi:10.1109/CVPR52734.2025.01416

BibTeX

@inproceedings{zhang2025cvpr-trainingfree,
  title     = {{Towards Training-Free Anomaly Detection with Vision and Language Foundation Models}},
  author    = {Zhang, Jinjin and Wang, Guodong and Jin, Yizhou and Huang, Di},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {15204-15213},
  doi       = {10.1109/CVPR52734.2025.01416},
  url       = {https://mlanthology.org/cvpr/2025/zhang2025cvpr-trainingfree/}
}