The Effect of Network Width on the Performance of Large-Batch Training
Abstract
Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.
Cite
Text
Chen et al. "The Effect of Network Width on the Performance of Large-Batch Training." Neural Information Processing Systems, 2018.Markdown
[Chen et al. "The Effect of Network Width on the Performance of Large-Batch Training." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/chen2018neurips-effect/)BibTeX
@inproceedings{chen2018neurips-effect,
title = {{The Effect of Network Width on the Performance of Large-Batch Training}},
author = {Chen, Lingjiao and Wang, Hongyi and Zhao, Jinman and Papailiopoulos, Dimitris and Koutris, Paraschos},
booktitle = {Neural Information Processing Systems},
year = {2018},
pages = {9302-9309},
url = {https://mlanthology.org/neurips/2018/chen2018neurips-effect/}
}