Simple Hardware-Efficient Long Convolutions for Sequence Modeling

Abstract

State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that simply squashing the long convolutional kernel weights is enough to match SSMs in performance on a range of tasks including the long range arena (LRA) and language modeling. To also improve runtime performance, we next develop FlashButterfly, an IO-aware algorithm to compute long convolutions efficiently. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up the LRA benchmark by 7.0× over Transformers, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2× faster than prior work.

Cite

Text

Fu et al. "Simple Hardware-Efficient Long Convolutions for Sequence Modeling." ICLR 2023 Workshops: ME-FoMo, 2023.

Markdown

[Fu et al. "Simple Hardware-Efficient Long Convolutions for Sequence Modeling." ICLR 2023 Workshops: ME-FoMo, 2023.](https://mlanthology.org/iclrw/2023/fu2023iclrw-simple/)

BibTeX

@inproceedings{fu2023iclrw-simple,
  title     = {{Simple Hardware-Efficient Long Convolutions for Sequence Modeling}},
  author    = {Fu, Daniel Y and Epstein, Elliot L and Nguyen, Eric and Thomas, Armin W and Zhang, Michael and Dao, Tri and Rudra, Atri and Re, Christopher},
  booktitle = {ICLR 2023 Workshops: ME-FoMo},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/fu2023iclrw-simple/}
}