RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
Abstract
Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.
Cite
Text
Varma et al. "RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-2614Markdown
[Varma et al. "RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/varma2024neurips-ravl/) doi:10.52202/079017-2614BibTeX
@inproceedings{varma2024neurips-ravl,
title = {{RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models}},
author = {Varma, Maya and Delbrouck, Jean-Benoit and Chen, Zhihong and Chaudhari, Akshay and Langlotz, Curtis},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-2614},
url = {https://mlanthology.org/neurips/2024/varma2024neurips-ravl/}
}