Lisa: Lazy Safety Alignment for Large Language Models Against Harmful Fine-Tuning Attack
Abstract
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. For the first time in the literature, we show that the jail-break effect can be mitigated by separating two states in the fine-tuning stage to respectively optimize over the alignment and user datasets. Unfortunately, our subsequent study shows that this simple Bi-State Optimization (BSO) solution experiences convergence instability when steps invested in its alignment state is too small, leading to downgraded alignment performance. By statistical analysis, we show that the \textit{excess drift} towards the switching iterates of the two states could be a probable reason for the instability. To remedy this issue, we propose \textbf{L}azy(\textbf{i}) \textbf{s}afety \textbf{a}lignment (\textbf{Lisa}), which introduces a proximal term to constraint the drift of each state. Theoretically, the benefit of the proximal term is supported by the convergence analysis, wherein we show that a sufficient large proximal factor is necessary to guarantee Lisa's convergence. Empirically, our results on four downstream fine-tuning tasks show that Lisa with a proximal term can significantly increase alignment performance while maintaining the LLM's accuracy on the user tasks. Code is available at https://github.com/git-disl/Lisa.
Cite
Text
Huang et al. "Lisa: Lazy Safety Alignment for Large Language Models Against Harmful Fine-Tuning Attack." Neural Information Processing Systems, 2024. doi:10.52202/079017-3320Markdown
[Huang et al. "Lisa: Lazy Safety Alignment for Large Language Models Against Harmful Fine-Tuning Attack." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/huang2024neurips-lisa/) doi:10.52202/079017-3320BibTeX
@inproceedings{huang2024neurips-lisa,
title = {{Lisa: Lazy Safety Alignment for Large Language Models Against Harmful Fine-Tuning Attack}},
author = {Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3320},
url = {https://mlanthology.org/neurips/2024/huang2024neurips-lisa/}
}