DSelect-K: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Abstract

The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable "sparse gate'" to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to 128 tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over 22% improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.

Cite

Text

Hazimeh et al. "DSelect-K: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning." Neural Information Processing Systems, 2021.

Markdown

[Hazimeh et al. "DSelect-K: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/hazimeh2021neurips-dselectk/)

BibTeX

@inproceedings{hazimeh2021neurips-dselectk,
  title     = {{DSelect-K: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning}},
  author    = {Hazimeh, Hussein and Zhao, Zhe and Chowdhery, Aakanksha and Sathiamoorthy, Maheswaran and Chen, Yihua and Mazumder, Rahul and Hong, Lichan and Chi, Ed},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/hazimeh2021neurips-dselectk/}
}