Scaling Laws from the Data Manifold Dimension

Abstract

When data is plentiful, the test loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-\alpha}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $\alpha \approx 4/d$ for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of $d$ and $\alpha$ by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.

Cite

Text

Sharma and Kaplan. "Scaling Laws from the Data Manifold Dimension." Journal of Machine Learning Research, 2022.

Markdown

[Sharma and Kaplan. "Scaling Laws from the Data Manifold Dimension." Journal of Machine Learning Research, 2022.](https://mlanthology.org/jmlr/2022/sharma2022jmlr-scaling/)

BibTeX

@article{sharma2022jmlr-scaling,
  title     = {{Scaling Laws from the Data Manifold Dimension}},
  author    = {Sharma, Utkarsh and Kaplan, Jared},
  journal   = {Journal of Machine Learning Research},
  year      = {2022},
  pages     = {1-34},
  volume    = {23},
  url       = {https://mlanthology.org/jmlr/2022/sharma2022jmlr-scaling/}
}