On the Spatial Structure of Mixture-of-Experts in Transformers

Abstract

A common assumption is that MoE routers primarily leverage semantic features for expert selection. However, our study challenges this notion by demonstrating that positional token information also plays a crucial role in routing decisions. Through extensive empirical analysis, we provide evidence supporting this hypothesis, develop a phenomenological explanation of the observed behavior, and discuss practical implications for MoE-based architectures.

Cite

Text

Bershatsky and Oseledets. "On the Spatial Structure of Mixture-of-Experts in Transformers." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Bershatsky and Oseledets. "On the Spatial Structure of Mixture-of-Experts in Transformers." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/bershatsky2025iclrw-spatial/)

BibTeX

@inproceedings{bershatsky2025iclrw-spatial,
  title     = {{On the Spatial Structure of Mixture-of-Experts in Transformers}},
  author    = {Bershatsky, Daniel and Oseledets, Ivan},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/bershatsky2025iclrw-spatial/}
}