Robust LLM Unlearning via Post Judgment and Multi-Round Thinking

Abstract

The unlearning capability of LLMs is vital for ensuring compliance and safety, especially when removing sensitive knowledge from deployed models. Pre-filtering methods, enabling rapid deployment without parameter changes, are a prominent unlearning approach. However, they exhibit significant robustness deficiencies against adversarial attacks: in the worst case, simple prefix attacks can induce up to a 1,150-fold surge in information leakage for fictitious entity knowledge, while composite question attacks can cause accuracy on hazardous knowledge to rebound from the 24.9% random-guess baseline to as high as 67.0%. To address this, we propose a new unlearning framework via post judgment and multi-round thinking (PoRT), which consists of three key modules. First, a data cleaning module compiles a dynamic few-shot prompt that instructs the LLM to simultaneously generate both a cleaned version of the user's query and a corresponding initial response, supported by an extensible demonstration library for adaptive defense. Second, unlike existing pre-filtering methods that typically judge based solely on prompts, our post-judgment module jointly evaluates cleaned prompts and their corresponding responses to better detect non-compliant outputs. Finally, a selective multi-round thinking process is employed to trigger LLM's self-correction for low-confidence outputs, enhancing reliability and result quality. Extensive experiments on benchmarks demonstrate PoRT's superior robustness against adversarial attacks and strong unlearning effectiveness without compromising general model utility. Code is available at https://github.com/ChnIRuI/PoRT_LLM_Unlearning

Cite

Text

Chen et al. "Robust LLM Unlearning via Post Judgment and Multi-Round Thinking." International Conference on Learning Representations, 2026.

Markdown

[Chen et al. "Robust LLM Unlearning via Post Judgment and Multi-Round Thinking." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/chen2026iclr-robust/)

BibTeX

@inproceedings{chen2026iclr-robust,
  title     = {{Robust LLM Unlearning via Post Judgment and Multi-Round Thinking}},
  author    = {Chen, Xinrui and Cao, Xu and Zhang, Jianhao and Zhao, Pinlong and Gao, Di and Wu, Ou},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/chen2026iclr-robust/}
}