MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models

Weng, Fenghua; Xu, Yue; Fu, Chengyan; Wang, Wenjie

doi:10.1609/AAAI.V39I26.34983

MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models

Fenghua Weng, Yue Xu, Chengyan Fu, Wenjie Wang

AAAI 2025 pp. 27689-27697

doi:10.1609/AAAI.V39I26.34983 /aaai/2025/weng2025aaai-mmj/

Abstract

As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Vision-Language Models (VLMs), have shown exceptional performance in many real-world tasks. However, VLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model’s safety alignment to elicit harmful responses. The threat of jailbreak attacks on VLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that VLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce MMJ-Bench, a unified pipeline for evaluating jailbreak attacks and defense techniques for VLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA VLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for VLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.

PDF AAAI Semantic Scholar

Cite

Text

Weng et al. "MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I26.34983

Markdown

[Weng et al. "MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/weng2025aaai-mmj/) doi:10.1609/AAAI.V39I26.34983

BibTeX

@inproceedings{weng2025aaai-mmj,
  title     = {{MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models}},
  author    = {Weng, Fenghua and Xu, Yue and Fu, Chengyan and Wang, Wenjie},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {27689-27697},
  doi       = {10.1609/AAAI.V39I26.34983},
  url       = {https://mlanthology.org/aaai/2025/weng2025aaai-mmj/}
}