Teaching Language Models with Canonical Examples

Abstract

It is easy to write a desirable or undesirable language model behavior (e.g., knowledge---The capital of Mauritius is Port Louis---or undesirable stereotypes---Researchers are always coldhearted) but it is difficult to make the model robustly generalize from these canonical examples. We formalize this task: a learning method takes a model and simple canonical examples and must produce a model that (1) generalizes to naturalistic examples, (2) stays within a bound of the original model's loss, and (3) performs well on a ``hard negative'' distribution to test overgeneralization. We build on the Backpack language model; its predictions take the form of a sparse weighted sum over a very large sense vector bank. We select and finetune a few Backpack senses per canonical example and find that this substantially outperforms other training methods. The Backpack we work with is only 170m parameters; yet, we find that it can improve much larger models: a product-of-experts ensemble between the 35x larger GPT-J-6B and the ratio of finetuned to pretrained Backpack outperforms finetuning GPT-J itself.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Hewitt et al. "Teaching Language Models with Canonical Examples." NeurIPS 2023 Workshops: R0-FoMo, 2023.

Markdown

[Hewitt et al. "Teaching Language Models with Canonical Examples." NeurIPS 2023 Workshops: R0-FoMo, 2023.](https://mlanthology.org/neuripsw/2023/hewitt2023neuripsw-teaching/)

BibTeX

@inproceedings{hewitt2023neuripsw-teaching,
  title     = {{Teaching Language Models with Canonical Examples}},
  author    = {Hewitt, John and Chen, Sarah Li and Liang, Percy and Manning, Christopher D},
  booktitle = {NeurIPS 2023 Workshops: R0-FoMo},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/hewitt2023neuripsw-teaching/}
}