Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

Abstract

Multimodal Language Models (MMLMs) typically undergo post-training alignment to prevent harmful content generation. However, these alignment stages focus primarily on the *assistant* role, leaving the *user* role unaligned, and sticking to a fixed input prompt structure of special tokens, making the model vulnerable when inputs deviate from these expectations. We introduce Role-Modality Attacks (RMA), a novel class of adversarial attacks that exploit role confusion between the *user* and *assistant* and alter the position of the *image* token to elicit harmful outputs. Unlike existing attacks that modify query content, RMAs manipulate the input structure without altering the query itself. We systematically evaluate these attacks across multiple Vision Language Models (VLMs) on eight distinct settings, showing that they can be composed to create stronger adversarial prompts, as also evidenced by their increased projection in the negative refusal direction in the residual stream, a property observed in prior successful attacks. Finally, for mitigation, we propose an adversarial training approach that makes the model robust against input prompt perturbations. By training the model on a range of harmful and benign prompts all perturbed with different RMA settings, the model loses its sensitivity to Role Confusion and Modality Manipulation attacks and is trained to only pay attention to the query content in the input prompt structure, effectively reducing Attack Success Rate (ASR) while preserving the model’s general utility.

Cite

Text

Shayegani et al. "Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots." International Conference on Learning Representations, 2026.

Markdown

[Shayegani et al. "Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/shayegani2026iclr-misaligned/)

BibTeX

@inproceedings{shayegani2026iclr-misaligned,
  title     = {{Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots}},
  author    = {Shayegani, Erfan and Shahariar, G M and Abdali, Sara and Yu, Lei and Abu-Ghazaleh, Nael and Dong, Yue},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/shayegani2026iclr-misaligned/}
}