Steering Fine-Tuning Generalization with Targeted Concept Ablation
Abstract
Models often learn unintended behaviors during fine-tuning, such as adopting spurious correlations present in training data. We present a novel technique for controlling what models learn during fine-tuning by identifying and ablating specific sparse autoencoder latents that represent undesired concepts. Our approach steers models toward intended generalizations even in cases where multiple policies correctly fit the training data. We evaluate our method on two tasks, significantly outperforming baselines: a gender bias task containing spurious correlations and a double multiple choice task where models must learn to focus on intended questions while ignoring others. On gender bias, our method completely eliminates spurious correlations, leading to strong performance out of distribution. In double multiple choice, it succeeds in 10 out of 16 scenarios. Our results mark an initial step toward using interpretability techniques to ensure the safe and reliable deployment of frontier AI systems.
Cite
Text
Casademunt et al. "Steering Fine-Tuning Generalization with Targeted Concept Ablation." ICLR 2025 Workshops: SLLM, 2025.Markdown
[Casademunt et al. "Steering Fine-Tuning Generalization with Targeted Concept Ablation." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/casademunt2025iclrw-steering-a/)BibTeX
@inproceedings{casademunt2025iclrw-steering-a,
title = {{Steering Fine-Tuning Generalization with Targeted Concept Ablation}},
author = {Casademunt, Helena and Juang, Caden and Rajamanoharan, Senthooran and Nanda, Neel},
booktitle = {ICLR 2025 Workshops: SLLM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/casademunt2025iclrw-steering-a/}
}