Learning-Time Encoding Shapes Unlearning in LLMs
Abstract
As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time encoding in knowledge encoding impact the effectiveness of unlearning factual knowledge. We conduct two studies: (i) examining how paraphrased descriptions influence unlearning performance, and (ii) analyzing unlearning when multiple facts are embedded within the same training text chunk. Our empirical study reveals two important implications: a new perspective for interpreting unlearning performance and practical strategies for improving LLM unlearning.
Cite
Text
Wu et al. "Learning-Time Encoding Shapes Unlearning in LLMs." International Conference on Learning Representations, 2026.Markdown
[Wu et al. "Learning-Time Encoding Shapes Unlearning in LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-learningtime/)BibTeX
@inproceedings{wu2026iclr-learningtime,
title = {{Learning-Time Encoding Shapes Unlearning in LLMs}},
author = {Wu, Ruihan and Garov, Konstantin and Chaudhuri, Kamalika},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/wu2026iclr-learningtime/}
}