RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality
Abstract
The widespread deployment of Large Language Models (LLMs) trained on massive, uncurated corpora has raised growing concerns about the inclusion of sensitive, copyrighted, or illegal content. This has led to increasing interest in LLM unlearning: the task of selectively removing specific information from a model without retraining from scratch or degrading overall utility. However, existing methods often rely on large-scale forget and retain datasets, and suffer from unnatural responses, poor generalization, or catastrophic utility loss. In this work, we propose $\textbf{R}$einforcement $\textbf{U}$n$\textbf{LE}$arning ($\textbf{RULE}$), an efficient framework that formulates unlearning as a refusal boundary optimization problem. RULE is trained with a small portion of forget set and synthesized boundary queries, using a verifiable reward function that encourages safe refusal on forget-related queries while preserving helpful responses on permissible inputs. We provide both theoretical and empirical evidence demonstrating the effectiveness of RULE in achieving targeted unlearning without compromising model utility. Experimental results show that, with only 12\% forget set and 8\% synthesized boundary data, RULE outperforms existing baselines by up to $17.4\%$ forget quality and $16.3\%$ naturalness response while maintaining general utility, achieving $\textit{forget-retain Pareto Optimality}$. Remarkably, we further observe that RULE improves the $\textit{naturalness}$ of model outputs, enhances training $\textit{efficiency}$, and exhibits strong $\textit{generalization ability}$, generalizing refusal behavior to semantically related but unseen queries.
Cite
Text
Zhang et al. "RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality." Advances in Neural Information Processing Systems, 2025.Markdown
[Zhang et al. "RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-rule/)BibTeX
@inproceedings{zhang2025neurips-rule,
title = {{RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality}},
author = {Zhang, Chenlong and Jin, Zhuoran and Yuan, Hongbang and Wei, Jiaheng and Zhou, Tong and Liu, Kang and Zhao, Jun and Chen, Yubo},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zhang2025neurips-rule/}
}