Wayward Concepts in Multimodal Models

Abstract

Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after optimizing just the prompt. How are prompt embeddings for visual concepts found by prompt tuning methods different from typical discrete prompts? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that prompts optimized to represent new visual concepts are akin to an adversarial attack on the text encoder. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $\epsilon$-ball to any prompt that reprogram models to generate, detect, and classify arbitrary subjects. These perturbations target the final-layers in text encoders, and steer pooling tokens towards the subject. We explore the transferability of these prompts, and find that perturbations reprogramming multimodal models are initialization-specific, and model-specific. Code for reproducing our work is available at the following site: https://wayward-concepts.github.io.

Cite

Text

Trabucco et al. "Wayward Concepts in Multimodal Models." International Conference on Learning Representations, 2025.

Markdown

[Trabucco et al. "Wayward Concepts in Multimodal Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/trabucco2025iclr-wayward/)

BibTeX

@inproceedings{trabucco2025iclr-wayward,
  title     = {{Wayward Concepts in Multimodal Models}},
  author    = {Trabucco, Brandon and Gurinas, Max A and Doherty, Kyle and Salakhutdinov, Russ},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/trabucco2025iclr-wayward/}
}