NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Yi, Xin; Zheng, Shunfan; Wang, Linlin; de Melo, Gerard; Wang, Xiaoling; He, Liang

doi:10.1609/AAAI.V39I24.34762

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Xin Yi, Shunfan Zheng, Linlin Wang, Gerard de Melo, Xiaoling Wang, Liang He

AAAI 2025 pp. 25706-25714

doi:10.1609/AAAI.V39I24.34762 /aaai/2025/yi2025aaai-nlsr/

Abstract

The emergence of fine-tuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the fine-tuning process, leading to a compromised alignment state. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose Neuron-Level Safety Realignment (NLSR), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings indicate that safety-critical neurons exhibit significant regional variations after fine-tuning, which can be effectively corrected through neuron transplantation from the reference model without the need for additional training.

PDF AAAI Semantic Scholar

Cite

Text

Yi et al. "NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I24.34762

Markdown

[Yi et al. "NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/yi2025aaai-nlsr/) doi:10.1609/AAAI.V39I24.34762

BibTeX

@inproceedings{yi2025aaai-nlsr,
  title     = {{NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning}},
  author    = {Yi, Xin and Zheng, Shunfan and Wang, Linlin and de Melo, Gerard and Wang, Xiaoling and He, Liang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {25706-25714},
  doi       = {10.1609/AAAI.V39I24.34762},
  url       = {https://mlanthology.org/aaai/2025/yi2025aaai-nlsr/}
}