Transformers, Parallel Computation, and Logarithmic Depth
Abstract
We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic-depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
Cite
Text
Sanford et al. "Transformers, Parallel Computation, and Logarithmic Depth." International Conference on Machine Learning, 2024.Markdown
[Sanford et al. "Transformers, Parallel Computation, and Logarithmic Depth." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/sanford2024icml-transformers/)BibTeX
@inproceedings{sanford2024icml-transformers,
title = {{Transformers, Parallel Computation, and Logarithmic Depth}},
author = {Sanford, Clayton and Hsu, Daniel and Telgarsky, Matus},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {43276-43327},
volume = {235},
url = {https://mlanthology.org/icml/2024/sanford2024icml-transformers/}
}