Zer0-Jack: A Memory-Efficient Gradient-Based Jailbreaking Method for Black Box Multi-Modal Large Language Models

Chen, Tiejin; Wang, Kaishen; Wei, Hua

Zer0-Jack: A Memory-Efficient Gradient-Based Jailbreaking Method for Black Box Multi-Modal Large Language Models

NeurIPSW 2024

/neuripsw/2024/chen2024neuripsw-zer0jack/

Abstract

Jailbreaking methods, which induce Multi-modal Large Language Models (MLLMs) to output harmful responses, raise significant safety concerns. Among these methods, gradient-based approaches, which use gradients to generate malicious prompts, have been widely studied due to their high success rates in white-box settings, where full access to the model is available. However, these methods have notable limitations: they require white-box access, which is not always feasible, and involve high memory usage. To address scenarios where white-box access is unavailable, attackers often resort to transfer attacks. In transfer attacks, malicious inputs generated using white-box models are applied to black-box models, but this typically results in reduced attack performance. To overcome these challenges, we propose Zer0-Jack, a method that bypasses the need for white-box access by leveraging zeroth-order optimization. We propose patch coordinate descent to efficiently generate malicious image inputs to directly attack black-box MLLMs, which significantly reduces memory usage further. Through extensive experiments, Zer0-Jack achieves a high attack success rate across various models, surpassing previous transfer-based methods and performing comparably with existing white-box jailbreak techniques. Notably, Zer0-Jack achieves a 95% attack success rate on MiniGPT-4 with the Harmful Behaviors Multi-modal Dataset, demonstrating its effectiveness. Additionally, we show that Zer0-Jack can directly attack commercial MLLMs such as GPT-4o. Codes are provided in the supplement.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Chen et al. "Zer0-Jack: A Memory-Efficient Gradient-Based Jailbreaking Method for Black Box Multi-Modal Large Language Models." NeurIPS 2024 Workshops: SafeGenAi, 2024.

Markdown

[Chen et al. "Zer0-Jack: A Memory-Efficient Gradient-Based Jailbreaking Method for Black Box Multi-Modal Large Language Models." NeurIPS 2024 Workshops: SafeGenAi, 2024.](https://mlanthology.org/neuripsw/2024/chen2024neuripsw-zer0jack/)

BibTeX

@inproceedings{chen2024neuripsw-zer0jack,
  title     = {{Zer0-Jack: A Memory-Efficient Gradient-Based Jailbreaking Method for Black Box Multi-Modal Large Language Models}},
  author    = {Chen, Tiejin and Wang, Kaishen and Wei, Hua},
  booktitle = {NeurIPS 2024 Workshops: SafeGenAi},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/chen2024neuripsw-zer0jack/}
}