Curvature-Aware Safety Restoration in LLMs Fine-Tuning

Abstract

Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise, low-impact updates. Extensive evaluations across multiple model families and adversarial settings show that our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.

Cite

Text

Bach et al. "Curvature-Aware Safety  Restoration in LLMs Fine-Tuning." Transactions on Machine Learning Research, 2026.

Markdown

[Bach et al. "Curvature-Aware Safety  Restoration in LLMs Fine-Tuning." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/bach2026tmlr-curvatureaware/)

BibTeX

@article{bach2026tmlr-curvatureaware,
  title     = {{Curvature-Aware Safety  Restoration in LLMs Fine-Tuning}},
  author    = {Bach, Thong and Nguyen-Tang, Thanh and Nguyen, Dung and Le, Thao Minh and Tran, Truyen},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/bach2026tmlr-curvatureaware/}
}