On the Spatial Structure of Mixture-of-Experts in Transformers
Abstract
A common assumption is that MoE routers primarily leverage semantic features for expert selection. However, our study challenges this notion by demonstrating that positional token information also plays a crucial role in routing decisions. Through extensive empirical analysis, we provide evidence supporting this hypothesis, develop a phenomenological explanation of the observed behavior, and discuss practical implications for MoE-based architectures.
Cite
Text
Bershatsky and Oseledets. "On the Spatial Structure of Mixture-of-Experts in Transformers." ICLR 2025 Workshops: SLLM, 2025.Markdown
[Bershatsky and Oseledets. "On the Spatial Structure of Mixture-of-Experts in Transformers." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/bershatsky2025iclrw-spatial/)BibTeX
@inproceedings{bershatsky2025iclrw-spatial,
title = {{On the Spatial Structure of Mixture-of-Experts in Transformers}},
author = {Bershatsky, Daniel and Oseledets, Ivan},
booktitle = {ICLR 2025 Workshops: SLLM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/bershatsky2025iclrw-spatial/}
}