Sub-Linear Memory: How to Make Performers SLiM

Abstract

Transformer architectures have become very popular yet the original implementation requires $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We conduct a thorough complexity analysis of Performers, a class which includes most recent linear Transformer mechanisms. We note a remarkable computational flexibility: the gradient computation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory, and still requires $O(L)$ time. Due to complete backward-compatibility, this discovered time-memory tradeoff can be used for fine-tuning on low-memory devices in a decentralized fashion without any server computations.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Likhosherstov et al. "Sub-Linear Memory: How to Make Performers SLiM." Neural Information Processing Systems, 2021.

Markdown

[Likhosherstov et al. "Sub-Linear Memory: How to Make Performers SLiM." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/likhosherstov2021neurips-sublinear/)

BibTeX

@inproceedings{likhosherstov2021neurips-sublinear,
  title     = {{Sub-Linear Memory: How to Make Performers SLiM}},
  author    = {Likhosherstov, Valerii and Choromanski, Krzysztof M and Davis, Jared Quincy and Song, Xingyou and Weller, Adrian},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/likhosherstov2021neurips-sublinear/}
}