Block-State Transformers

Abstract

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity.Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks.In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences.We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention.We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates a more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

Cite

Text

Pilault et al. "Block-State Transformers." Neural Information Processing Systems, 2023.

Markdown

[Pilault et al. "Block-State Transformers." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/pilault2023neurips-blockstate/)

BibTeX

@inproceedings{pilault2023neurips-blockstate,
  title     = {{Block-State Transformers}},
  author    = {Pilault, Jonathan and Fathi, Mahan and Firat, Orhan and Pal, Chris and Bacon, Pierre-Luc and Goroshin, Ross},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/pilault2023neurips-blockstate/}
}