Robustifying Zero-Shot Vision Language Models by Subspaces Alignment

Junhao Dong, Piotr Koniusz, Liaoyuan Feng, Yifei Zhang, Hao Zhu, Weiming Liu, Xinghua Qu, Yew-Soon Ong

ICCV 2025 pp. 21037-21047

/iccv/2025/dong2025iccv-robustifying/

Abstract

Vision-Language Models (VLMs) enjoy strong zero-shot performance but are vulnerable to adversarial attacks posing security risks. Adversarially robust fine-tuning enhances zero-shot robustness on new datasets while preserving the natural performance of pre-trained VLMs. However, prior methods use sample-wise adversarial fine-tuning, neglecting the underlying second-order statistics that represent entire groups of samples. This leads to a feature-level discrepancy between clean and adversarial samples of their augmented variants. Thus, we propose to represent groups of samples as subspaces to capture distributions and turn the traditional sample-wise adversarial fine-tuning into its distributional counterpart. For each image, we build distributions from (i) a clean sample with its augmentations and (ii) their adversarial counterparts. For text, we build distributions from (iii) a clean prompt and its synonymous prompts and (iv) their adversarial counterparts. We then perform alignment between image and text subspaces, and "adversarial" subspaces are also aligned toward "clean" subspaces. Thus, all samples underlying these distributions (think infinite number) also get aligned, leading to generalizable robustness. Evaluations on 15 datasets are provided.

PDF ICCV Semantic Scholar

Cite

Text

Dong et al. "Robustifying Zero-Shot Vision Language Models by Subspaces Alignment." International Conference on Computer Vision, 2025.

Markdown

[Dong et al. "Robustifying Zero-Shot Vision Language Models by Subspaces Alignment." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/dong2025iccv-robustifying/)

BibTeX

@inproceedings{dong2025iccv-robustifying,
  title     = {{Robustifying Zero-Shot Vision Language Models by Subspaces Alignment}},
  author    = {Dong, Junhao and Koniusz, Piotr and Feng, Liaoyuan and Zhang, Yifei and Zhu, Hao and Liu, Weiming and Qu, Xinghua and Ong, Yew-Soon},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {21037-21047},
  url       = {https://mlanthology.org/iccv/2025/dong2025iccv-robustifying/}
}