Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models Under Zipf’s Law

Abstract

Recent works have highlighted the optimization difficulties encountered by gradient descent in training the first and last layer of transformer-based language models, which are overcome by optimizers such as Adam. The problem appears linked to the heavy-tailed distribution of words in text data, where the frequency of the $k$th most frequent word $\pi_k$ is proportional to $1/k$, following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power-law $\pi_k \propto 1/k^\alpha$ parameterized by the exponent $\alpha$. We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the power $\alpha \geq 0$. This setting differs from existing theoretical investigations in scaling laws which assume that the eigenvalues of the data decay as a power with power $\alpha > 1$. This assumption effectively makes the problem "finite dimensional" as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case $\alpha = 1$ as found in text is ``worst-case'' for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement over gradient descent for large vocabularies.

Cite

Text

Kunstner and Bach. "Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models Under Zipf’s Law." Advances in Neural Information Processing Systems, 2025.

Markdown

[Kunstner and Bach. "Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models Under Zipf’s Law." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/kunstner2025neurips-scaling/)

BibTeX

@inproceedings{kunstner2025neurips-scaling,
  title     = {{Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models Under Zipf’s Law}},
  author    = {Kunstner, Frederik and Bach, Francis},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/kunstner2025neurips-scaling/}
}