BAM! Just like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Abstract

Training Mixture of Experts (MoEs) from scratch in a large-scale regime is expensive. Previous work addresses this challenge by independently training multiple dense expert models and using them to initialize an MoE. In particular, initializing MoE layers using experts' feed-forward parameters while merging all other parameters. This limits the advantages of the specialized dense models when ``upcycling'' them as MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective improvement to MoE training. BAM makes full use of specialized dense models by not only using their feed-forward network (FFN) to initialize the MoE layers but also leveraging experts’ attention weights fully by initializing them as Mixture of Attention (MoA) layers. Our experiments using seed models ranging from 590 million to 2 billion parameters show that our approach outperforms state-of-the-art approaches under the same data and compute budget in both perplexity and downstream tasks evaluations.

Cite

Text

Zhang et al. "BAM! Just like That: Simple and Efficient Parameter Upcycling for Mixture of Experts." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Zhang et al. "BAM! Just like That: Simple and Efficient Parameter Upcycling for Mixture of Experts." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/zhang2024icmlw-bam/)

BibTeX

@inproceedings{zhang2024icmlw-bam,
  title     = {{BAM! Just like That: Simple and Efficient Parameter Upcycling for Mixture of Experts}},
  author    = {Zhang, Qizhen and Gritsch, Nikolas and Gnaneshwar, Dwaraknath and Guo, Simon and Cairuz, David and Venkitesh, Bharat and Foerster, Jakob Nicolaus and Blunsom, Phil and Ruder, Sebastian and Üstün, Ahmet and Locatelli, Acyr},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/zhang2024icmlw-bam/}
}