Jump Self-Attention: Capturing High-Order Statistics in Transformers

Abstract

The recent success of Transformer has benefited many real-world applications, with its capability of building long dependency through pairwise dot-products. However, the strong assumption that elements are directly attentive to each other limits the performance of tasks with high-order dependencies such as natural language understanding and Image captioning. To solve such problems, we are the first to define the Jump Self-attention (JAT) to build Transformers. Inspired by the pieces moving of English Draughts, we introduce the spectral convolutional technique to calculate JAT on the dot-product feature map. This technique allows JAT's propagation in each self-attention head and is interchangeable with the canonical self-attention. We further develop the higher-order variants under the multi-hop assumption to increase the generality. Moreover, the proposed architecture is compatible with the pre-trained models. With extensive experiments, we empirically show that our methods significantly increase the performance on ten different tasks.

Cite

Text

Zhou et al. "Jump Self-Attention: Capturing High-Order Statistics in Transformers." Neural Information Processing Systems, 2022.

Markdown

[Zhou et al. "Jump Self-Attention: Capturing High-Order Statistics in Transformers." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/zhou2022neurips-jump/)

BibTeX

@inproceedings{zhou2022neurips-jump,
  title     = {{Jump Self-Attention: Capturing High-Order Statistics in Transformers}},
  author    = {Zhou, Haoyi and Xiao, Siyang and Zhang, Shanghang and Peng, Jieqi and Zhang, Shuai and Li, Jianxin},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/zhou2022neurips-jump/}
}