Beating SGD Saturation with Tail-Averaging and Minibatching

Abstract

While stochastic gradient descent (SGD) is one of the major workhorses in machine learning, the learning properties of many practically used variants are still poorly understood. In this paper, we consider least squares learning in a nonparametric setting and contribute to filling this gap by focusing on the effect and interplay of multiple passes, mini-batching and averaging, in particular tail averaging. Our results show how these different variants of SGD can be combined to achieve optimal learning rates, also providing practical insights. A novel key result is that tail averaging allows faster convergence rates than uniform averaging in the nonparametric setting. Further, we show that a combination of tail-averaging and minibatching allows more aggressive step-size choices than using any one of said components.

Cite

Text

Muecke et al. "Beating SGD Saturation with Tail-Averaging  and Minibatching." Neural Information Processing Systems, 2019.

Markdown

[Muecke et al. "Beating SGD Saturation with Tail-Averaging  and Minibatching." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/muecke2019neurips-beating/)

BibTeX

@inproceedings{muecke2019neurips-beating,
  title     = {{Beating SGD Saturation with Tail-Averaging  and Minibatching}},
  author    = {Muecke, Nicole and Neu, Gergely and Rosasco, Lorenzo},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {12568-12577},
  url       = {https://mlanthology.org/neurips/2019/muecke2019neurips-beating/}
}