Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Abstract

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.

Cite

Text

Li et al. "Circuit Breaking: Removing Model Behaviors with Targeted Ablation." ICML 2023 Workshops: DeployableGenerativeAI, 2023.

Markdown

[Li et al. "Circuit Breaking: Removing Model Behaviors with Targeted Ablation." ICML 2023 Workshops: DeployableGenerativeAI, 2023.](https://mlanthology.org/icmlw/2023/li2023icmlw-circuit/)

BibTeX

@inproceedings{li2023icmlw-circuit,
  title     = {{Circuit Breaking: Removing Model Behaviors with Targeted Ablation}},
  author    = {Li, Maximilian and Davies, Xander and Nadeau, Max},
  booktitle = {ICML 2023 Workshops: DeployableGenerativeAI},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/li2023icmlw-circuit/}
}