A Thorough Reproduction and Evaluation of $\mu$P

Abstract

This paper is an independent empirical reproduction of the claimed benefits of the $\mu$P parametrization proposed in \citet{yang2020feature} and \citet{yang2021tuning}. Under the so-called Standard Parametrization (SP), the weights of neural networks are initialized from the Gaussian distribution with variance scaling as the inverse of ``fan-in'', with the learning rate being the same for every layer. While this guarantees that (pre)activations are $\mathcal{O}(1)$ at initialization with respect to width, it causes their scale to be width-dependent during training. To address this, \citet{yang2020feature} and \citet{yang2021tuning} proposed the Maximal Update Parametrization ($\mu$P), which is also claimed to make the optimal value of various hyperparameters independent of width. However, despite its alleged benefits, $\mu$P has not gained much traction among practitioners. Possibly, this could stem from a lack of thorough independent evaluation of $\mu$P against SP. We address this by independently reproducing the empirical claims of the original works. At the same time, we substantially increase the scale of the experiments, by training $16000$ neural networks of sizes from $500$ to $1$B parameters, and empirically investigate $\mu$P's effect on outputs, gradient updates, weights, training loss and validation loss. We find that generally $\mu$P indeed delivers on its promises, even though this does not always translate to improved generalization.

Cite

Text

Vlassis et al. "A Thorough Reproduction and Evaluation of $\mu$P." Transactions on Machine Learning Research, 2025.

Markdown

[Vlassis et al. "A Thorough Reproduction and Evaluation of $\mu$P." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/vlassis2025tmlr-thorough/)

BibTeX

@article{vlassis2025tmlr-thorough,
  title     = {{A Thorough Reproduction and Evaluation of $\mu$P}},
  author    = {Vlassis, Georgios and Belius, David and Fomichov, Volodymyr},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/vlassis2025tmlr-thorough/}
}