Global Convergence of Gradient Descent for Deep Linear Residual Networks

Abstract

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an $\varepsilon$-optimal point in $O\left( L^3 \log(1/\varepsilon) \right)$ iterations, which scales polynomially with the network depth $L$. Our result and the $\exp(\Omega(L))$ convergence time for the standard initialization (Xavier or near-identity) \cite{shamir2018exponential} together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when $L$ is large.

Cite

Text

Wu et al. "Global Convergence of Gradient Descent  for Deep Linear Residual Networks." Neural Information Processing Systems, 2019.

Markdown

[Wu et al. "Global Convergence of Gradient Descent  for Deep Linear Residual Networks." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/wu2019neurips-global/)

BibTeX

@inproceedings{wu2019neurips-global,
  title     = {{Global Convergence of Gradient Descent  for Deep Linear Residual Networks}},
  author    = {Wu, Lei and Wang, Qingcan and Ma, Chao},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {13389-13398},
  url       = {https://mlanthology.org/neurips/2019/wu2019neurips-global/}
}