How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks

Abstract

Fine-tuning large pre-trained models has become the *de facto* strategy for developing models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in *synthetic* settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. *This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.* We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

Cite

Text

Jain et al. "How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks." NeurIPS 2023 Workshops: R0-FoMo, 2023.

Markdown

[Jain et al. "How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks." NeurIPS 2023 Workshops: R0-FoMo, 2023.](https://mlanthology.org/neuripsw/2023/jain2023neuripsw-finetuning/)

BibTeX

@inproceedings{jain2023neuripsw-finetuning,
  title     = {{How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks}},
  author    = {Jain, Samyak and Kirk, Robert and Lubana, Ekdeep Singh and Dick, Robert P. and Tanaka, Hidenori and Rocktäschel, Tim and Grefenstette, Edward and Krueger, David},
  booktitle = {NeurIPS 2023 Workshops: R0-FoMo},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/jain2023neuripsw-finetuning/}
}