On the Representation Collapse of Sparse Mixture of Experts

Abstract

Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

Cite

Text

Chi et al. "On the Representation Collapse of Sparse Mixture of Experts." Neural Information Processing Systems, 2022.

Markdown

[Chi et al. "On the Representation Collapse of Sparse Mixture of Experts." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/chi2022neurips-representation/)

BibTeX

@inproceedings{chi2022neurips-representation,
  title     = {{On the Representation Collapse of Sparse Mixture of Experts}},
  author    = {Chi, Zewen and Dong, Li and Huang, Shaohan and Dai, Damai and Ma, Shuming and Patra, Barun and Singhal, Saksham and Bajaj, Payal and Song, Xia and Mao, Xian-Ling and Huang, Heyan and Wei, Furu},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/chi2022neurips-representation/}
}