Jump Self-Attention: Capturing High-Order Statistics in Transformers
Abstract
The recent success of Transformer has benefited many real-world applications, with its capability of building long dependency through pairwise dot-products. However, the strong assumption that elements are directly attentive to each other limits the performance of tasks with high-order dependencies such as natural language understanding and Image captioning. To solve such problems, we are the first to define the Jump Self-attention (JAT) to build Transformers. Inspired by the pieces moving of English Draughts, we introduce the spectral convolutional technique to calculate JAT on the dot-product feature map. This technique allows JAT's propagation in each self-attention head and is interchangeable with the canonical self-attention. We further develop the higher-order variants under the multi-hop assumption to increase the generality. Moreover, the proposed architecture is compatible with the pre-trained models. With extensive experiments, we empirically show that our methods significantly increase the performance on ten different tasks.
Cite
Text
Zhou et al. "Jump Self-Attention: Capturing High-Order Statistics in Transformers." Neural Information Processing Systems, 2022.Markdown
[Zhou et al. "Jump Self-Attention: Capturing High-Order Statistics in Transformers." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/zhou2022neurips-jump/)BibTeX
@inproceedings{zhou2022neurips-jump,
title = {{Jump Self-Attention: Capturing High-Order Statistics in Transformers}},
author = {Zhou, Haoyi and Xiao, Siyang and Zhang, Shanghang and Peng, Jieqi and Zhang, Shuai and Li, Jianxin},
booktitle = {Neural Information Processing Systems},
year = {2022},
url = {https://mlanthology.org/neurips/2022/zhou2022neurips-jump/}
}