From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

Abstract

Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50\% post-unlearning to nearly 100\% with fine-tuning on just the *retain* set---i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50\%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, $L_2$-distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

Cite

Text

Siddiqui et al. "From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization." Advances in Neural Information Processing Systems, 2025.

Markdown

[Siddiqui et al. "From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/siddiqui2025neurips-dormant/)

BibTeX

@inproceedings{siddiqui2025neurips-dormant,
  title     = {{From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization}},
  author    = {Siddiqui, Shoaib Ahmed and Weller, Adrian and Krueger, David and Dziugaite, Gintare Karolina and Mozer, Michael Curtis and Triantafillou, Eleni},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/siddiqui2025neurips-dormant/}
}