CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection
Abstract
Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Given the CLIP-based discriminative model's limited capacity to capture fine-grained local details, we incorporate a diffusion-based generative model to complement its features. This integration yields a synergistic solution for anomaly detection. To this end, we propose using diffusion models as feature extractors for anomaly detection, and introduce carefully designed strategies to extract informative cross-attention and feature maps. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods in both anomaly segmentation and classification under both zero-shot and few-shot settings. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.
Cite
Text
Lee et al. "CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection." Transactions on Machine Learning Research, 2025.Markdown
[Lee et al. "CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/lee2025tmlr-clip/)BibTeX
@article{lee2025tmlr-clip,
title = {{CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection}},
author = {Lee, Byeongchan and Won, John and Lee, Seunghyun and Shin, Jinwoo},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/lee2025tmlr-clip/}
}