On the Zero-Shot Adversarial Robustness of Vision-Language Models: A Truly Zero-Shot and Training-Free Approach
Abstract
Pre-trained Vision-Language Models (VLMs) like CLIP, have demonstrated strong zero-shot generalization capabilities. Despite their effectiveness on various downstream tasks, they remain vulnerable to adversarial samples. Existing methods fine-tune VLMs to improve their performance via performing adversarial training on a certain dataset. However, this can lead to model overfitting and is not a true zero-shot scenario. In this paper, we propose a truly zero-shot and training-free approach that can significantly improve the VLM's zero-shot adversarial robustness. Specifically, we first discover that simply adding Gaussian noise greatly enhances the VLM's zero-shot performance. Then, we treat the adversarial examples with added Gaussian noise as anchors and strive to find a path in the embedding space that leads from the adversarial examples to the cleaner samples. We improve the VLMs' generalization abilities in a truly zero-shot and training-free manner compared to previous methods. Extensive experiments on 16 datasets demonstrate that our method can achieve state-of-the-art zero-shot robust performance, improving the top-1 robust accuracy by an average of 9.77%. The code will be publicly available.
Cite
Text
Tong et al. "On the Zero-Shot Adversarial Robustness of Vision-Language Models: A Truly Zero-Shot and Training-Free Approach." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01855Markdown
[Tong et al. "On the Zero-Shot Adversarial Robustness of Vision-Language Models: A Truly Zero-Shot and Training-Free Approach." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/tong2025cvpr-zeroshot/) doi:10.1109/CVPR52734.2025.01855BibTeX
@inproceedings{tong2025cvpr-zeroshot,
title = {{On the Zero-Shot Adversarial Robustness of Vision-Language Models: A Truly Zero-Shot and Training-Free Approach}},
author = {Tong, Baoshun and Lai, Hanjiang and Pan, Yan and Yin, Jian},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {19921-19930},
doi = {10.1109/CVPR52734.2025.01855},
url = {https://mlanthology.org/cvpr/2025/tong2025cvpr-zeroshot/}
}