A Statistical Framework for Weak-to-Strong Generalization

Abstract

Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether the techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unclear whether it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback *without degrading their capabilities*. This is an instance of the weak-to-strong generalization problem: using weaker (less capable) feedback to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an alternative refinement-based approach suggested by the problem structure provably overcomes the limitations of fine-tuning. Finally, we demonstrate the practical applicability of the refinement approach on a persona learning task.

Cite

Text

Somerstep et al. "A Statistical Framework for Weak-to-Strong Generalization." ICML 2024 Workshops: NextGenAISafety, 2024.

Markdown

[Somerstep et al. "A Statistical Framework for Weak-to-Strong Generalization." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/somerstep2024icmlw-statistical/)

BibTeX

@inproceedings{somerstep2024icmlw-statistical,
  title     = {{A Statistical Framework for Weak-to-Strong Generalization}},
  author    = {Somerstep, Seamus and Polo, Felipe Maia and Banerjee, Moulinath and Ritov, Yaacov and Yurochkin, Mikhail and Sun, Yuekai},
  booktitle = {ICML 2024 Workshops: NextGenAISafety},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/somerstep2024icmlw-statistical/}
}