Are Sixteen Heads Really Better than One?

Abstract

Multi-headed attention is a driving force behind recent state-of-the-art NLP models. By applying multiple attention mechanisms in parallel, it can express sophisticated functions beyond the simple weighted average. However we observe that, in practice, a large proportion of attention heads can be removed at test time without significantly impacting performance, and that some layers can even be reduced to a single head. Further analysis on machine translation models reveals that the self-attention layers can be significantly pruned, while the encoder-decoder layers are more dependent on multi-headedness.

Cite

Text

Michel et al. "Are Sixteen Heads Really Better than One?." Neural Information Processing Systems, 2019.

Markdown

[Michel et al. "Are Sixteen Heads Really Better than One?." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/michel2019neurips-sixteen/)

BibTeX

@inproceedings{michel2019neurips-sixteen,
  title     = {{Are Sixteen Heads Really Better than One?}},
  author    = {Michel, Paul and Levy, Omer and Neubig, Graham},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {14014-14024},
  url       = {https://mlanthology.org/neurips/2019/michel2019neurips-sixteen/}
}