Language Model Scaling Laws and Zero-Sum Learning
Abstract
This work aims to understand how, in terms of training dynamics, scaling up language model size yields predictable loss improvements. We find that these improvements can be tied back to loss deceleration, an abrupt transition in the rate of loss improvement, characterized by piece-wise linear behavior in log-log space. Notably, improvements from increased model size appear to be a result of (1) improving the loss at which this transition occurs; and (2) improving the rate of loss improvement after this transition. As an explanation for the mechanism underlying this transition (and the effect of model size on loss it mediates), we propose the zero-sum learning (ZSL) hypothesis. In ZSL, per-token gradients become systematically opposed, leading to degenerate training dynamics where the model can't improve loss on one token without harming it on another; bottlenecking the overall rate at which loss can improve. We find compelling evidence of ZSL, as well as unexpected results which shed light on other factors contributing to ZSL.
Cite
Text
Mircea et al. "Language Model Scaling Laws and Zero-Sum Learning." NeurIPS 2024 Workshops: SciForDL, 2024.Markdown
[Mircea et al. "Language Model Scaling Laws and Zero-Sum Learning." NeurIPS 2024 Workshops: SciForDL, 2024.](https://mlanthology.org/neuripsw/2024/mircea2024neuripsw-language/)BibTeX
@inproceedings{mircea2024neuripsw-language,
title = {{Language Model Scaling Laws and Zero-Sum Learning}},
author = {Mircea, Andrei and Lobacheva, Ekaterina and Chakraborty, Supriyo and Chitsazan, Nima and Rish, Irina},
booktitle = {NeurIPS 2024 Workshops: SciForDL},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/mircea2024neuripsw-language/}
}