Learning Over-Parametrized Two-Layer Neural Networks Beyond NTK

Abstract

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an \emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from \emph{random initialization}, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.

Cite

Text

Li et al. "Learning Over-Parametrized Two-Layer Neural Networks Beyond NTK." Conference on Learning Theory, 2020.

Markdown

[Li et al. "Learning Over-Parametrized Two-Layer Neural Networks Beyond NTK." Conference on Learning Theory, 2020.](https://mlanthology.org/colt/2020/li2020colt-learning/)

BibTeX

@inproceedings{li2020colt-learning,
  title     = {{Learning Over-Parametrized Two-Layer Neural Networks Beyond NTK}},
  author    = {Li, Yuanzhi and Ma, Tengyu and Zhang, Hongyang R.},
  booktitle = {Conference on Learning Theory},
  year      = {2020},
  pages     = {2613-2682},
  volume    = {125},
  url       = {https://mlanthology.org/colt/2020/li2020colt-learning/}
}