U-$\mu$P: The Unit-Scaled Maximal Update Parametrization

Abstract

The Maximal Update Parametrization ($\mu$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$\mu$P, which improves upon $\mu$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $\mu$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$\mu$P models reaching a lower loss than comparable $\mu$P models and working out-of-the-box in FP8.

Cite

Text

Blake et al. "U-$\mu$P: The Unit-Scaled Maximal Update Parametrization." NeurIPS 2024 Workshops: OPT, 2024.

Markdown

[Blake et al. "U-$\mu$P: The Unit-Scaled Maximal Update Parametrization." NeurIPS 2024 Workshops: OPT, 2024.](https://mlanthology.org/neuripsw/2024/blake2024neuripsw-up/)

BibTeX

@inproceedings{blake2024neuripsw-up,
  title     = {{U-$\mu$P: The Unit-Scaled Maximal Update Parametrization}},
  author    = {Blake, Charlie and Eichenberg, Constantin and Dean, Josef and Balles, Lukas and Prince, Luke Yuri and Deiseroth, Björn and Cruz-Salinas, Andres Felipe and Luschi, Carlo and Weinbach, Samuel and Orr, Douglas},
  booktitle = {NeurIPS 2024 Workshops: OPT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/blake2024neuripsw-up/}
}