Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics
Abstract
Yang (2020) recently showed that the Neural Tangent Kernel (NTK) at initialization has an infinite-width limit for a large class of architectures including modern staples such as ResNet and Transformers. However, their analysis does not apply to training. Here, we show the same neural networks (in the so-called NTK parametrization) during training follow a kernel gradient descent dynamics in function space, where the kernel is the infinite-width NTK. This completes the proof of the architectural universality of NTK behavior. To achieve this result, we apply the Tensor Programs technique: Write the entire SGD dynamics inside a Tensor Program and analyze it via the Master Theorem. To facilitate this proof, we develop a graphical notation for Tensor Programs, which we believe is also an important contribution toward the pedagogy and exposition of the Tensor Programs technique.
Cite
Text
Yang and Littwin. "Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics." International Conference on Machine Learning, 2021.Markdown
[Yang and Littwin. "Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/yang2021icml-tensor/)BibTeX
@inproceedings{yang2021icml-tensor,
title = {{Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics}},
author = {Yang, Greg and Littwin, Etai},
booktitle = {International Conference on Machine Learning},
year = {2021},
pages = {11762-11772},
volume = {139},
url = {https://mlanthology.org/icml/2021/yang2021icml-tensor/}
}