Self-Supervised Alignment with Mutual Information: Learning to Follow Principles Without Preference Labels

Abstract

When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., "summaries should be scientific") and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Fränken et al. "Self-Supervised Alignment with Mutual Information: Learning to Follow Principles Without Preference Labels." Neural Information Processing Systems, 2024. doi:10.52202/079017-1961

Markdown

[Fränken et al. "Self-Supervised Alignment with Mutual Information: Learning to Follow Principles Without Preference Labels." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/franken2024neurips-selfsupervised/) doi:10.52202/079017-1961

BibTeX

@inproceedings{franken2024neurips-selfsupervised,
  title     = {{Self-Supervised Alignment with Mutual Information: Learning to Follow Principles Without Preference Labels}},
  author    = {Fränken, Jan-Philipp and Zelikman, Eric and Rafailov, Rafael and Gandhi, Kanishk and Gerstenberg, Tobias and Goodman, Noah D.},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-1961},
  url       = {https://mlanthology.org/neurips/2024/franken2024neurips-selfsupervised/}
}