Panacea: Mitigating Harmful Fine-Tuning for Large Language Models via Post-Fine-Tuning Perturbation

Abstract

Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Main-stream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile-- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution-- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behaviors, though it leads to a degradation in the model’s fine-tuning performance. To address the degradation of fine-tuning performance, we further propose \methodname, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. \methodname maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.2%, while maintaining fine-tuning performance. As a by-product, we analyze the adaptive perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://github.com/w-yibo/Panacea.

Cite

Text

Wang et al. "Panacea: Mitigating Harmful Fine-Tuning for Large Language Models via Post-Fine-Tuning Perturbation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "Panacea: Mitigating Harmful Fine-Tuning for Large Language Models via Post-Fine-Tuning Perturbation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-panacea/)

BibTeX

@inproceedings{wang2025neurips-panacea,
  title     = {{Panacea: Mitigating Harmful Fine-Tuning for Large Language Models via Post-Fine-Tuning Perturbation}},
  author    = {Wang, Yibo and Huang, Tiansheng and Shen, Li and Yao, Huanjin and Luo, Haotian and Liu, Rui and Tan, Naiqiang and Huang, Jiaxing and Tao, Dacheng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-panacea/}
}