On the Convergence of Encoder-Only Shallow Transformers

Abstract

In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.

Cite

Text

Wu et al. "On the Convergence of Encoder-Only Shallow Transformers." Neural Information Processing Systems, 2023.

Markdown

[Wu et al. "On the Convergence of Encoder-Only Shallow Transformers." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/wu2023neurips-convergence/)

BibTeX

@inproceedings{wu2023neurips-convergence,
  title     = {{On the Convergence of Encoder-Only Shallow Transformers}},
  author    = {Wu, Yongtao and Liu, Fanghui and Chrysos, Grigorios and Cevher, Volkan},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/wu2023neurips-convergence/}
}