The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Abstract

In deep learning theory, the covariance matrix of the representations serves as aproxy to examine the network’s trainability. Motivated by the success of Transform-ers, we study the covariance matrix of a modified Softmax-based attention modelwith skip connections in the proportional limit of infinite-depth-and-width. Weshow that at initialization the limiting distribution can be described by a stochasticdifferential equation (SDE) indexed by the depth-to-width ratio. To achieve awell-defined stochastic limit, the Transformer’s attention mechanism is modifiedby centering the Softmax output at identity, and scaling the Softmax logits by awidth-dependent temperature parameter. We examine the stability of the networkthrough the corresponding SDE, showing how the scale of both the drift and diffu-sion can be elegantly controlled with the aid of residual connections. The existenceof a stable SDE implies that the covariance structure is well-behaved, even for verylarge depth and width, thus preventing the notorious issues of rank degeneracyin deep attention models. Finally, we show, through simulations, that the SDEprovides a surprisingly good description of the corresponding finite-size model.We coin the name shaped Transformer for these architectural modifications.

Cite

Text

Noci et al. "The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit." Neural Information Processing Systems, 2023.

Markdown

[Noci et al. "The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/noci2023neurips-shaped/)

BibTeX

@inproceedings{noci2023neurips-shaped,
  title     = {{The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit}},
  author    = {Noci, Lorenzo and Li, Chuning and Li, Mufan and He, Bobby and Hofmann, Thomas and Maddison, Chris J and Roy, Dan},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/noci2023neurips-shaped/}
}