Attention-Only Transformers and Implementing MLPs with Attention Heads

Abstract

The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads.

Cite

Text

Huben and Morris. "Attention-Only Transformers and Implementing MLPs with Attention Heads." NeurIPS 2023 Workshops: M3L, 2023.

Markdown

[Huben and Morris. "Attention-Only Transformers and Implementing MLPs with Attention Heads." NeurIPS 2023 Workshops: M3L, 2023.](https://mlanthology.org/neuripsw/2023/huben2023neuripsw-attentiononly/)

BibTeX

@inproceedings{huben2023neuripsw-attentiononly,
  title     = {{Attention-Only Transformers and Implementing MLPs with Attention Heads}},
  author    = {Huben, Robert and Morris, Valerie},
  booktitle = {NeurIPS 2023 Workshops: M3L},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/huben2023neuripsw-attentiononly/}
}