Multi-Scale Self-Attention for Text Classification

Abstract

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.

Cite

Text

Guo et al. "Multi-Scale Self-Attention for Text Classification." AAAI Conference on Artificial Intelligence, 2020. doi:10.1609/AAAI.V34I05.6290

Markdown

[Guo et al. "Multi-Scale Self-Attention for Text Classification." AAAI Conference on Artificial Intelligence, 2020.](https://mlanthology.org/aaai/2020/guo2020aaai-multi-a/) doi:10.1609/AAAI.V34I05.6290

BibTeX

@inproceedings{guo2020aaai-multi-a,
  title     = {{Multi-Scale Self-Attention for Text Classification}},
  author    = {Guo, Qipeng and Qiu, Xipeng and Liu, Pengfei and Xue, Xiangyang and Zhang, Zheng},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2020},
  pages     = {7847-7854},
  doi       = {10.1609/AAAI.V34I05.6290},
  url       = {https://mlanthology.org/aaai/2020/guo2020aaai-multi-a/}
}