Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Abstract

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

Cite

Text

Halawi et al. "Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation." International Conference on Machine Learning, 2024.

Markdown

[Halawi et al. "Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/halawi2024icml-covert/)

BibTeX

@inproceedings{halawi2024icml-covert,
  title     = {{Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation}},
  author    = {Halawi, Danny and Wei, Alexander and Wallace, Eric and Wang, Tony Tong and Haghtalab, Nika and Steinhardt, Jacob},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {17298-17312},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/halawi2024icml-covert/}
}