Transformers, Parallel Computation, and Logarithmic Depth

Abstract

We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic-depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

Cite

Text

Sanford et al. "Transformers, Parallel Computation, and Logarithmic Depth." International Conference on Machine Learning, 2024.

Markdown

[Sanford et al. "Transformers, Parallel Computation, and Logarithmic Depth." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/sanford2024icml-transformers/)

BibTeX

@inproceedings{sanford2024icml-transformers,
  title     = {{Transformers, Parallel Computation, and Logarithmic Depth}},
  author    = {Sanford, Clayton and Hsu, Daniel and Telgarsky, Matus},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {43276-43327},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/sanford2024icml-transformers/}
}