SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Abstract
Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.
Cite
Text
Csordás et al. "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention." Neural Information Processing Systems, 2024. doi:10.52202/079017-2368Markdown
[Csordás et al. "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/csordas2024neurips-switchhead/) doi:10.52202/079017-2368BibTeX
@inproceedings{csordas2024neurips-switchhead,
title = {{SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention}},
author = {Csordás, Róbert and Piękos, Piotr and Irie, Kazuki and Schmidhuber, Jürgen},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-2368},
url = {https://mlanthology.org/neurips/2024/csordas2024neurips-switchhead/}
}