Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
Abstract
Studies of scaling ladders have shown that the compute-optimal Pareto frontier of a family of loss curves can have a predictable shape, often a power law. We use a series of small transformer models to demonstrate that the full loss curves themselves have a consistent shape — collapsing onto a single universal curve after an affine rescaling. Surprisingly, the deviations in the rescaled curves across model sizes are smaller than deviations induced by randomn initialization and data ordering in the raw loss curves, a phenomenon we call supercollapse. We recreate this phenomenon in a simplified setting of training MLPs on a synthetic regression dataset. By analyzing both the original model and our simplified model, we identify necessary conditions for supercollapse, including compute-optimal training, learning rate decay, and a power-law compute-loss Pareto frontier, and demonstrate its sensitivity to the estimate of the irreducible loss. Our study hints at a broader, dynamical universality induced by compute-optimal scaling procedures.
Cite
Text
Qiu et al. "Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks." NeurIPS 2024 Workshops: OPT, 2024.Markdown
[Qiu et al. "Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks." NeurIPS 2024 Workshops: OPT, 2024.](https://mlanthology.org/neuripsw/2024/qiu2024neuripsw-scaling/)BibTeX
@inproceedings{qiu2024neuripsw-scaling,
title = {{Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks}},
author = {Qiu, Shikai and Agarwala, Atish and Pennington, Jeffrey and Xiao, Lechao},
booktitle = {NeurIPS 2024 Workshops: OPT},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/qiu2024neuripsw-scaling/}
}