Reformer: The Efficient Transformer

Abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L \log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Cite

Text

Kitaev et al. "Reformer: The Efficient Transformer." International Conference on Learning Representations, 2020.

Markdown

[Kitaev et al. "Reformer: The Efficient Transformer." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/kitaev2020iclr-reformer/)

BibTeX

@inproceedings{kitaev2020iclr-reformer,
  title     = {{Reformer: The Efficient Transformer}},
  author    = {Kitaev, Nikita and Kaiser, Łukasz and Levskaya, Anselm},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/kitaev2020iclr-reformer/}
}