Perception-Guided Jailbreak Against Text-to-Image Models

Abstract

In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

Cite

Text

Huang et al. "Perception-Guided Jailbreak Against Text-to-Image Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I25.34821

Markdown

[Huang et al. "Perception-Guided Jailbreak Against Text-to-Image Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/huang2025aaai-perception/) doi:10.1609/AAAI.V39I25.34821

BibTeX

@inproceedings{huang2025aaai-perception,
  title     = {{Perception-Guided Jailbreak Against Text-to-Image Models}},
  author    = {Huang, Yihao and Liang, Le and Li, Tianlin and Jia, Xiaojun and Wang, Run and Miao, Weikai and Pu, Geguang and Liu, Yang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {26238-26247},
  doi       = {10.1609/AAAI.V39I25.34821},
  url       = {https://mlanthology.org/aaai/2025/huang2025aaai-perception/}
}