Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks
Abstract
Fine-tuning large pre-trained models has become the de facto strategy for devel- oping models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modu- late existing ones? We address this question empirically in synthetic settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model’s underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a ‘wrapper’, is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient “revival” of the capability, i.e., the model begins reusing this capability in a few gradient steps. This indicates practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task.
Cite
Text
Jain et al. "Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks." ICLR 2024 Workshops: R2-FM, 2024.Markdown
[Jain et al. "Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/jain2024iclrw-mechanistically/)BibTeX
@inproceedings{jain2024iclrw-mechanistically,
title = {{Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks}},
author = {Jain, Samyak and Kirk, Robert and Lubana, Ekdeep Singh and Dick, Robert P. and Tanaka, Hidenori and Rocktäschel, Tim and Grefenstette, Edward and Krueger, David},
booktitle = {ICLR 2024 Workshops: R2-FM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/jain2024iclrw-mechanistically/}
}