Limits to Depth Efficiencies of Self-Attention

Abstract

Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: Empirical signals indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth). In this paper, we theoretically study the interplay between depth and width in self-attention. We shed light on the root of the above phenomenon, and establish two distinct parameter regimes of depth efficiency and inefficiency in self-attention. We invalidate the seemingly plausible hypothesis by which widening is as effective as deepening for self-attention, and show that in fact stacking self-attention layers is so effective that it quickly saturates a capacity of the network width. Specifically, we pinpoint a ``depth threshold" that is logarithmic in the network width: for networks of depth that is below the threshold, we establish a double-exponential depth-efficiency of the self-attention operation, while for depths over the threshold we show that depth-inefficiency kicks in. Our predictions accord with existing empirical ablations, and we further demonstrate the two depth-(in)efficiency regimes experimentally for common network depths of 6, 12, and 24. By identifying network width as a limiting factor, our analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity.

Cite

Text

Levine et al. "Limits to Depth Efficiencies of Self-Attention." Neural Information Processing Systems, 2020.

Markdown

[Levine et al. "Limits to Depth Efficiencies of Self-Attention." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/levine2020neurips-limits/)

BibTeX

@inproceedings{levine2020neurips-limits,
  title     = {{Limits to Depth Efficiencies of Self-Attention}},
  author    = {Levine, Yoav and Wies, Noam and Sharir, Or and Bata, Hofit and Shashua, Amnon},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/levine2020neurips-limits/}
}