ConceptDrift: Uncovering Biases Through the Lens of Foundation Models

Abstract

Datasets and pre-trained models come with intrinsic biases. Most methods rely on spotting them by analyzing misclassified samples, in a semi-automated human-computer validation. In contrast, we propose ConceptDrift, a method that analyzes the weights of a linear probe, learned on top of a foundation model. We capitalize on the weight update trajectory, which starts from the embedding of the textual representation of the class, and proceeds to drift towards embeddings that disclose hidden biases. Different from prior work, with this approach we can pin-point unwanted correlations from a dataset, providing more than just possible explanations for the wrong predictions. We empirically prove the efficacy of our method, by significantly improving zero-shot performance with biased-augmented prompting. Our method is not bound to a single modality, and we experiment in this work with both image (Waterbirds, CelebA, Nico++) and text datasets (CivilComments).

Cite

Text

Paduraru et al. "ConceptDrift: Uncovering Biases Through the Lens of Foundation Models." NeurIPS 2024 Workshops: InterpretableAI, 2024.

Markdown

[Paduraru et al. "ConceptDrift: Uncovering Biases Through the Lens of Foundation Models." NeurIPS 2024 Workshops: InterpretableAI, 2024.](https://mlanthology.org/neuripsw/2024/paduraru2024neuripsw-conceptdrift/)

BibTeX

@inproceedings{paduraru2024neuripsw-conceptdrift,
  title     = {{ConceptDrift: Uncovering Biases Through the Lens of Foundation Models}},
  author    = {Paduraru, Cristian Daniel and Barbalau, Antonio and Filipescu, Radu and Nicolicioiu, Andrei Liviu and Burceanu, Elena},
  booktitle = {NeurIPS 2024 Workshops: InterpretableAI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/paduraru2024neuripsw-conceptdrift/}
}