Distributional Scaling Laws for Emergent Capabilities
Abstract
In this paper, we explore the nature of sudden breakthroughs in language model performance at scale, which stands in contrast to smooth improvements governed by scaling laws. While advocates of ``emergence" argue that abrupt performance gains arise from acquiring new capabilities at specific scales, recent work has suggested that these are illusions caused by thresholding effects. We propose an alternative explanation: that breakthroughs are driven by random variation, particularly multimodal performance distributions across random seeds. Using a length generalization task as a case study, we show that different random seeds lead to both highly linear or emergent behavior. We further demonstrate that the probability of a model acquiring a breakthrough capability increases continuously with scale, despite apparent discontinuities in performance. Additionally, we find that scaling models in width versus depth has distinct effects: depth impacts the likelihood of sampling from a successful distribution, while width improves the average performance of successful models. These insights suggest a need to consider the role of random variation in scaling and emergent capabilities in LMs.
Cite
Text
Zhao et al. "Distributional Scaling Laws for Emergent Capabilities." NeurIPS 2024 Workshops: SciForDL, 2024.Markdown
[Zhao et al. "Distributional Scaling Laws for Emergent Capabilities." NeurIPS 2024 Workshops: SciForDL, 2024.](https://mlanthology.org/neuripsw/2024/zhao2024neuripsw-distributional/)BibTeX
@inproceedings{zhao2024neuripsw-distributional,
title = {{Distributional Scaling Laws for Emergent Capabilities}},
author = {Zhao, Rosie and Saphra, Naomi and Kakade, Sham M.},
booktitle = {NeurIPS 2024 Workshops: SciForDL},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/zhao2024neuripsw-distributional/}
}