Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Abstract

To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the vulnerability brought by function words. Inspired by differential transformers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more robust alignment. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53\% ASR drop with only 0.2/0.3/0.6\% performance drops on the 3 tested models on retrieval, and a 90\% ASR drop with a 0.3\% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code is available at https://github.com/michaeltian108/FDA.

Cite

Text

Tian et al. "Pay Less Attention to Function Words for Free Robustness of Vision-Language Models." International Conference on Learning Representations, 2026.

Markdown

[Tian et al. "Pay Less Attention to Function Words for Free Robustness of Vision-Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/tian2026iclr-pay/)

BibTeX

@inproceedings{tian2026iclr-pay,
  title     = {{Pay Less Attention to Function Words for Free Robustness of Vision-Language Models}},
  author    = {Tian, Qiwei and Lin, Chenhao and Zhao, Zhengyu and Shen, Chao},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/tian2026iclr-pay/}
}