Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning
Abstract
Deep learning is also known as hierarchical learning, where the learner $\textit{learns}$ to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning $\textit{efficiently}$ and $\textit{automatically}$ by applying stochastic gradient descent (SGD) or its variants on the training objective.On the conceptual side, we present a theoretical characterizations of how certain types of deep (i.e. super-constantly many layers) neural networks can still be sample and time efficiently trained on some hierarchical learning tasks, when no known existing algorithm (including layerwise training, kernel method, etc) is efficient. We establish a new principle called “backward feature correction”, where \emph{the errors in the lower-level features can be automatically corrected when training together with the higher-level layers}. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layerwise learning or simulating some known non-hierarchical method.
Cite
Text
Allen-Zhu and Li. "Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning." Conference on Learning Theory, 2023.Markdown
[Allen-Zhu and Li. "Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning." Conference on Learning Theory, 2023.](https://mlanthology.org/colt/2023/allenzhu2023colt-backward/)BibTeX
@inproceedings{allenzhu2023colt-backward,
title = {{Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning}},
author = {Allen-Zhu, Zeyuan and Li, Yuanzhi},
booktitle = {Conference on Learning Theory},
year = {2023},
pages = {4598-4598},
volume = {195},
url = {https://mlanthology.org/colt/2023/allenzhu2023colt-backward/}
}