Composite Slice Transformer: An Efficient Transformer with Composition of Multi-Scale Multi-Range Attentions

Abstract

Since the introduction of Transformers, researchers have tackled the notoriously expensive quadratic complexity problem. While significant computational efficiency improvements have been achieved, they come at the cost of reduced accuracy trade-offs. In this paper, we propose Composite Slice Transformer (CST), a Transformer-based network equipped with a composition of multi-scale multi-range attentions, boosting both efficiency and modeling capability. After stacking fixed-length slices of the input sequence, each layer in CST performs a pair of fine-and-coarse-grained attentions with short-long ranges in a sequential manner, coupled with volatile instant positional embedding, enabling efficient token interactions {\em and} improving expressiveness of the model. In addition to significantly reduced $O(NL+N^2/L^2)$ complexity for sequence length $N$ and slice length $L$, CST achieves superior performance on a variety of tasks. We show that CST surpasses recently published efficient Transformers on the Long Range Arena benchmark, demonstrating the bidirectional long-range dependency modeling capability of our model. It outperforms the standard Transformer by a margin of $6.9$\% in average accuracy across the five classification tasks of the benchmark, while being of complexity comparable to other efficient transformers. Furthermore, on the word-level autoregressive language modeling task with the WikiText-103 dataset, CST performs competitively against the Transformer model with only $2$\% gap in the test perplexity while outperforming other efficient Transformers.

Cite

Text

Lee et al. "Composite Slice Transformer: An Efficient Transformer with Composition of Multi-Scale Multi-Range Attentions." International Conference on Learning Representations, 2023.

Markdown

[Lee et al. "Composite Slice Transformer: An Efficient Transformer with Composition of Multi-Scale Multi-Range Attentions." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/lee2023iclr-composite/)

BibTeX

@inproceedings{lee2023iclr-composite,
  title     = {{Composite Slice Transformer: An Efficient Transformer with Composition of Multi-Scale Multi-Range Attentions}},
  author    = {Lee, Mingu and Pitre, Saurabh and Jiang, Tianyu and Letourneau, Pierre-David and Morse, Matthew J and Jang, Kanghwan and Soriaga, Joseph and Noorzad, Parham and Cheng, Hsin-Pai and Lott, Christopher},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/lee2023iclr-composite/}
}