Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Abstract

Fine-tuning large language models on task-specific datasets can enhance their performance on downstream tasks. However, recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo safety alignment and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined tasks remains distinct from the instruction-following context due to structural differences in the data. Our work explores the risks associated with fine-tuning closed source models across diverse task-specific data. We demonstrate how malicious actors can subtly manipulate the structure of almost *any* task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To mitigate this issue, we propose a novel strategy that mixes in safety data which *mimics* the format and style of the user data, showing this is more effective than the baselines at re-establishing safety while maintaining similar task performance.

Cite

Text

Eiras et al. "Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models." ICML 2024 Workshops: NextGenAISafety, 2024.

Markdown

[Eiras et al. "Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/eiras2024icmlw-mimicking/)

BibTeX

@inproceedings{eiras2024icmlw-mimicking,
  title     = {{Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models}},
  author    = {Eiras, Francisco and Petrov, Aleksandar and Torr, Philip and Kumar, M. Pawan and Bibi, Adel},
  booktitle = {ICML 2024 Workshops: NextGenAISafety},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/eiras2024icmlw-mimicking/}
}