The Effect of Fine-Tuning on Language Model Toxicity
Abstract
Fine-tuning language models has become increasingly popular following the proliferation of open models and improvements in cost-effective parameter efficient fine-tuning. However, fine-tuning can influence model properties such as safety. We assess how fine-tuning can impact different open models’ propensity to output toxic content. We assess the impacts of fine-tuning Gemma, Llama, and Phi models on toxicity through three experiments. We compare how toxicity is reduced by model developers during instruction-tuning. We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation on a non-adversarial dataset can significantly alter these results across models. Finally, we highlight the impact of this in the wild, demonstrating how toxicity rates of models fine-tuned by community contributors can deviate in hard-to-predict ways.
Cite
Text
Hawkins et al. "The Effect of Fine-Tuning on Language Model Toxicity." NeurIPS 2024 Workshops: SafeGenAi, 2024.Markdown
[Hawkins et al. "The Effect of Fine-Tuning on Language Model Toxicity." NeurIPS 2024 Workshops: SafeGenAi, 2024.](https://mlanthology.org/neuripsw/2024/hawkins2024neuripsw-effect/)BibTeX
@inproceedings{hawkins2024neuripsw-effect,
title = {{The Effect of Fine-Tuning on Language Model Toxicity}},
author = {Hawkins, Will and Mittelstadt, Brent and Russell, Chris},
booktitle = {NeurIPS 2024 Workshops: SafeGenAi},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/hawkins2024neuripsw-effect/}
}