BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Abstract

In this paper, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks. Code is available at https://github.com/Vinsonzyh/BlueSuffix.

Cite

Text

Zhao et al. "BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks." International Conference on Learning Representations, 2025.

Markdown

[Zhao et al. "BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/zhao2025iclr-bluesuffix/)

BibTeX

@inproceedings{zhao2025iclr-bluesuffix,
  title     = {{BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks}},
  author    = {Zhao, Yunhan and Zheng, Xiang and Luo, Lin and Li, Yige and Ma, Xingjun and Jiang, Yu-Gang},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/zhao2025iclr-bluesuffix/}
}