Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts

Abstract

Sparsely-gated Mixture-of-Experts (MoEs) have proven to be more efficient than dense Transformers because they can dynamically activate a subset of their overall parameters by routing tokens to selected experts, allowing practitioners to scale up model parameter counts without significantly increasing total compute. However, current MoE training approaches only update the router with a sparse gradient and suffer from issues such as load imbalance. We propose a new router that can receive a dense gradient update from a sparse forward pass. Our method adds minimal overhead, but improves on the common Top-K routing in both performance and load balance.

Cite

Text

Panda et al. "Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts." NeurIPS 2024 Workshops: OPT, 2024.

Markdown

[Panda et al. "Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts." NeurIPS 2024 Workshops: OPT, 2024.](https://mlanthology.org/neuripsw/2024/panda2024neuripsw-dense-a/)

BibTeX

@inproceedings{panda2024neuripsw-dense-a,
  title     = {{Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts}},
  author    = {Panda, Ashwinee and Baherwani, Vatsal and Sarwar, Zain and Thérien, Benjamin and Rawls, Stephen and Sahu, Sambit and Chakraborty, Supriyo and Goldstein, Tom},
  booktitle = {NeurIPS 2024 Workshops: OPT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/panda2024neuripsw-dense-a/}
}