Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
Abstract
We revisit the canonical problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD’s dynamics. Along the way, we prove gradient descent automatically balances student neurons and uses this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
Cite
Text
Xu and Du. "Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron." Conference on Learning Theory, 2023.Markdown
[Xu and Du. "Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron." Conference on Learning Theory, 2023.](https://mlanthology.org/colt/2023/xu2023colt-overparameterization/)BibTeX
@inproceedings{xu2023colt-overparameterization,
title = {{Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron}},
author = {Xu, Weihang and Du, Simon},
booktitle = {Conference on Learning Theory},
year = {2023},
pages = {1155-1198},
volume = {195},
url = {https://mlanthology.org/colt/2023/xu2023colt-overparameterization/}
}