The Effect of Network Width on Stochastic Gradient Descent and Generalization: An Empirical Study
Abstract
We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.
Cite
Text
Park et al. "The Effect of Network Width on Stochastic Gradient Descent and Generalization: An Empirical Study." International Conference on Machine Learning, 2019.Markdown
[Park et al. "The Effect of Network Width on Stochastic Gradient Descent and Generalization: An Empirical Study." International Conference on Machine Learning, 2019.](https://mlanthology.org/icml/2019/park2019icml-effect/)BibTeX
@inproceedings{park2019icml-effect,
title = {{The Effect of Network Width on Stochastic Gradient Descent and Generalization: An Empirical Study}},
author = {Park, Daniel and Sohl-Dickstein, Jascha and Le, Quoc and Smith, Samuel},
booktitle = {International Conference on Machine Learning},
year = {2019},
pages = {5042-5051},
volume = {97},
url = {https://mlanthology.org/icml/2019/park2019icml-effect/}
}