MoH: Multi-Head Attention as Mixture-of-Head Attention

ICML 2025 pp. 28233-28255

Abstract

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to reduce computational costs while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%$\sim$90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

Cite

Text

Jin et al. "MoH: Multi-Head Attention as Mixture-of-Head Attention." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Jin et al. "MoH: Multi-Head Attention as Mixture-of-Head Attention." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/jin2025icml-moh/)

BibTeX

@inproceedings{jin2025icml-moh,
  title     = {{MoH: Multi-Head Attention as Mixture-of-Head Attention}},
  author    = {Jin, Peng and Zhu, Bo and Yuan, Li and Yan, Shuicheng},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {28233-28255},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/jin2025icml-moh/}
}