How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks
Abstract
Fine-tuning large pre-trained models has become the *de facto* strategy for developing models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in *synthetic* settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. *This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.* We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.
Cite
Text
Jain et al. "How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks." NeurIPS 2023 Workshops: R0-FoMo, 2023.Markdown
[Jain et al. "How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks." NeurIPS 2023 Workshops: R0-FoMo, 2023.](https://mlanthology.org/neuripsw/2023/jain2023neuripsw-finetuning/)BibTeX
@inproceedings{jain2023neuripsw-finetuning,
title = {{How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks}},
author = {Jain, Samyak and Kirk, Robert and Lubana, Ekdeep Singh and Dick, Robert P. and Tanaka, Hidenori and Rocktäschel, Tim and Grefenstette, Edward and Krueger, David},
booktitle = {NeurIPS 2023 Workshops: R0-FoMo},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/jain2023neuripsw-finetuning/}
}