An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Abstract

Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.

Cite

Text

Nayak and Varshney. "An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models." NeurIPS 2024 Workshops: Compression, 2024.

Markdown

[Nayak and Varshney. "An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models." NeurIPS 2024 Workshops: Compression, 2024.](https://mlanthology.org/neuripsw/2024/nayak2024neuripsw-information/)

BibTeX

@inproceedings{nayak2024neuripsw-information,
  title     = {{An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models}},
  author    = {Nayak, Anuj K. and Varshney, Lav R.},
  booktitle = {NeurIPS 2024 Workshops: Compression},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/nayak2024neuripsw-information/}
}