Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

Abstract

We revisit the canonical problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD’s dynamics. Along the way, we prove gradient descent automatically balances student neurons and uses this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.

Cite

Text

Xu and Du. "Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron." Conference on Learning Theory, 2023.

Markdown

[Xu and Du. "Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron." Conference on Learning Theory, 2023.](https://mlanthology.org/colt/2023/xu2023colt-overparameterization/)

BibTeX

@inproceedings{xu2023colt-overparameterization,
  title     = {{Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron}},
  author    = {Xu, Weihang and Du, Simon},
  booktitle = {Conference on Learning Theory},
  year      = {2023},
  pages     = {1155-1198},
  volume    = {195},
  url       = {https://mlanthology.org/colt/2023/xu2023colt-overparameterization/}
}