Toxic Neurons Aren’t Enough to Explain DPO: A Mechanistic Analysis for Toxicity Reduction
Abstract
Safety fine-tuning algorithms are widely used to reduce harmful outputs in language models. While studies show that these algorithms induce minimal changes to pre-trained model parameters, the mechanisms of how such small parameter changes lead to harm reduction remain unclear. When studying the direct preference optimization (DPO) algorithm for toxicity reduction, current explanation claims that DPO reduces toxicity by dampening activations of the most toxic MLP neurons. However, our activation patching experiments show that this explanation is incomplete. Projections onto a toxicity probe show that only 4.9% of toxicity reduction comes from dampened toxic neurons. Instead, DPO reduces toxicity through distributed activation shifts across four neuron groups: two removing toxicity and two promoting anti-toxicity, cumulatively shifting MLP outputs away from toxicity. Neurons that do not promote toxic tokens still contribute to this reduction through their weakly aligned components. These distributed activation shifts, induced from DPO's minimal parameter changes, form a mask over the pre-trained toxic capabilities, while being small enough to preserve model's general language capabilities. Building on these insights, we propose an activation patching technique on the identified neuron groups, outperforming DPO in reducing toxicity while maintaining general language capabilities.
Cite
Text
Yang et al. "Toxic Neurons Aren’t Enough to Explain DPO: A Mechanistic Analysis for Toxicity Reduction." NeurIPS 2024 Workshops: SoLaR, 2024.Markdown
[Yang et al. "Toxic Neurons Aren’t Enough to Explain DPO: A Mechanistic Analysis for Toxicity Reduction." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/yang2024neuripsw-toxic/)BibTeX
@inproceedings{yang2024neuripsw-toxic,
title = {{Toxic Neurons Aren’t Enough to Explain DPO: A Mechanistic Analysis for Toxicity Reduction}},
author = {Yang, Yushi and Sondej, Filip and Mayne, Harry and Mahdi, Adam},
booktitle = {NeurIPS 2024 Workshops: SoLaR},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/yang2024neuripsw-toxic/}
}