MoEC: Mixture of Expert Clusters

Abstract

Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE models convert dense layers into sparse experts, and utilize a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress towards improving performance by scaling up. We verify that there exists a performance upper bound of scaling up sparse MoE. In this work, we propose Mixture of Expert Clusters — a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. Given this, we could further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that MoEC could improve performance on machine translation and natural language understanding tasks. MoEC plays a positive role in mitigating overfitting and sparse data allocation problems, thus fully releasing the potential of large-scale sparse models.

Cite

Text

Xie et al. "MoEC: Mixture of Expert Clusters." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I11.26617

Markdown

[Xie et al. "MoEC: Mixture of Expert Clusters." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/xie2023aaai-moec/) doi:10.1609/AAAI.V37I11.26617

BibTeX

@inproceedings{xie2023aaai-moec,
  title     = {{MoEC: Mixture of Expert Clusters}},
  author    = {Xie, Yuan and Huang, Shaohan and Chen, Tianyu and Wei, Furu},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {13807-13815},
  doi       = {10.1609/AAAI.V37I11.26617},
  url       = {https://mlanthology.org/aaai/2023/xie2023aaai-moec/}
}