Rethinking Gauss-Newton for Learning Over-Parameterized Models

Abstract

This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method.While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.

Cite

Text

Arbel et al. "Rethinking Gauss-Newton for Learning Over-Parameterized Models." Neural Information Processing Systems, 2023.

Markdown

[Arbel et al. "Rethinking Gauss-Newton for Learning Over-Parameterized Models." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/arbel2023neurips-rethinking/)

BibTeX

@inproceedings{arbel2023neurips-rethinking,
  title     = {{Rethinking Gauss-Newton for Learning Over-Parameterized Models}},
  author    = {Arbel, Michael and Menegaux, Romain and Wolinski, Pierre},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/arbel2023neurips-rethinking/}
}