Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Abstract

Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (VLMs, zero-shot classification) that rely on CLIP. No retraining or fine-tuning of the VLM is required.

Cite

Text

Schlarmann et al. "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models." ICLR 2024 Workshops: ME-FoMo, 2024.

Markdown

[Schlarmann et al. "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models." ICLR 2024 Workshops: ME-FoMo, 2024.](https://mlanthology.org/iclrw/2024/schlarmann2024iclrw-robust/)

BibTeX

@inproceedings{schlarmann2024iclrw-robust,
  title     = {{Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models}},
  author    = {Schlarmann, Christian and Singh, Naman Deep and Croce, Francesco and Hein, Matthias},
  booktitle = {ICLR 2024 Workshops: ME-FoMo},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/schlarmann2024iclrw-robust/}
}