Compute-Optimal LLMs Provably Generalize Better with Scale

Abstract

Why do larger language models generalize better? To explore this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. The generalization bound can be broken into three contributions: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As language models are scaled up, the number of parameters per data point stays constant; however, both the loss variance and the quantization error decrease, implying that larger models should have \emph{smaller} generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows slower than their capacity on the compute optimal frontier. From these findings we produce a scaling law for the generalization gap, showing that our bounds decrease in a predictable way.

Cite

Text

Finzi et al. "Compute-Optimal LLMs Provably Generalize Better with Scale." International Conference on Learning Representations, 2025.

Markdown

[Finzi et al. "Compute-Optimal LLMs Provably Generalize Better with Scale." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/finzi2025iclr-computeoptimal/)

BibTeX

@inproceedings{finzi2025iclr-computeoptimal,
  title     = {{Compute-Optimal LLMs Provably Generalize Better with Scale}},
  author    = {Finzi, Marc Anton and Kapoor, Sanyam and Granziol, Diego and Gu, Anming and De Sa, Christopher and Kolter, J Zico and Wilson, Andrew Gordon},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/finzi2025iclr-computeoptimal/}
}