Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit
Abstract
We study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.
Cite
Text
Bordelon et al. "Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit." NeurIPS 2023 Workshops: M3L, 2023.Markdown
[Bordelon et al. "Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit." NeurIPS 2023 Workshops: M3L, 2023.](https://mlanthology.org/neuripsw/2023/bordelon2023neuripsw-depthwise/)BibTeX
@inproceedings{bordelon2023neuripsw-depthwise,
title = {{Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit}},
author = {Bordelon, Blake and Noci, Lorenzo and Li, Mufan and Hanin, Boris and Pehlevan, Cengiz},
booktitle = {NeurIPS 2023 Workshops: M3L},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/bordelon2023neuripsw-depthwise/}
}