LEACE: Perfect Linear Concept Erasure in Closed Form

Abstract

Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called concept scrubbing, which erases target concept information from every layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Our code is available at https://github.com/EleutherAI/concept-erasure.

Cite

Text

Belrose et al. "LEACE: Perfect Linear Concept Erasure in Closed Form." Neural Information Processing Systems, 2023.

Markdown

[Belrose et al. "LEACE: Perfect Linear Concept Erasure in Closed Form." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/belrose2023neurips-leace/)

BibTeX

@inproceedings{belrose2023neurips-leace,
  title     = {{LEACE: Perfect Linear Concept Erasure in Closed Form}},
  author    = {Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/belrose2023neurips-leace/}
}