AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

Abstract

Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrate a dual-evasion strategy in the optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in the blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).

Cite

Text

Liu et al. "AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts." International Conference on Computer Vision, 2025.

Markdown

[Liu et al. "AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/liu2025iccv-autoprompt/)

BibTeX

@inproceedings{liu2025iccv-autoprompt,
  title     = {{AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts}},
  author    = {Liu, Yufan and Zhang, Wanqian and Chen, Huashan and Wang, Lin and Jia, Xiaojun and Lin, Zheng and Wang, Weiping},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17557-17566},
  url       = {https://mlanthology.org/iccv/2025/liu2025iccv-autoprompt/}
}