Do as I Do (Safely): Mitigating Task-Specific Fine-Tuning Risks in Large Language Models
Abstract
Recent research shows that fine-tuning on benign instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. While instruction-following fine-tuning is important, task-specific fine-tuning-where models are trained on datasets with clear ground truth answers (e.g., multiple choice questions)-can enhance model performance on specialized downstream tasks. Understanding and mitigating safety risks in the task-specific setting remains distinct from the instruction-following context due to structural differences in the data. Our work demonstrates how malicious actors can subtly manipulate the structure of almost *any* task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which *mimics* the task format and prompting style of the user data, showing this is significantly more effective and efficient than existing baselines at re-establishing safety alignment while maintaining similar task performance.
Cite
Text
Eiras et al. "Do as I Do (Safely): Mitigating Task-Specific Fine-Tuning Risks in Large Language Models." International Conference on Learning Representations, 2025.Markdown
[Eiras et al. "Do as I Do (Safely): Mitigating Task-Specific Fine-Tuning Risks in Large Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/eiras2025iclr-safely/)BibTeX
@inproceedings{eiras2025iclr-safely,
title = {{Do as I Do (Safely): Mitigating Task-Specific Fine-Tuning Risks in Large Language Models}},
author = {Eiras, Francisco and Petrov, Aleksandar and Torr, Philip and Kumar, M. Pawan and Bibi, Adel},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/eiras2025iclr-safely/}
}