Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Abstract

We lack a systematic understanding of the effects of fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback), particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on fine-tuning tasks comes at the expense of other pretraining capabilities. We hypothesize that models implicitly infer the task of the prompt and that fine-tuning skews this inference towards fine-tuning tasks. We find that artificially making the task look farther from the fine-tuning distribution while requiring the same capability can recover some of the pretraining capabilities on our synthetic setup. Since real fine-tuning distributions are predominantly English, we apply conjugate prompting to recover pretrained capabilities in LLMs by simply translating the prompts to different languages. This allows us to recover the in-context learning abilities lost via instruction tuning, and more concerningly, recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT.

Cite

Text

Kotha et al. "Understanding Catastrophic Forgetting in Language Models via Implicit Inference." NeurIPS 2023 Workshops: DistShift, 2023.

Markdown

[Kotha et al. "Understanding Catastrophic Forgetting in Language Models via Implicit Inference." NeurIPS 2023 Workshops: DistShift, 2023.](https://mlanthology.org/neuripsw/2023/kotha2023neuripsw-understanding/)

BibTeX

@inproceedings{kotha2023neuripsw-understanding,
  title     = {{Understanding Catastrophic Forgetting in Language Models via Implicit Inference}},
  author    = {Kotha, Suhas and Springer, Jacob and Raghunathan, Aditi},
  booktitle = {NeurIPS 2023 Workshops: DistShift},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/kotha2023neuripsw-understanding/}
}