Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Abstract
Mixture of Experts (MoE) LLMs enhance performance by selectively activating specialized subnetworks ("experts") per input. While MoEs offer efficiency benefits through distributed inference in typical high-throughput settings, deploying them on memory-constrained devices remains challenging, particularly for sequential token generation with batch size one. In this work, we optimize MoE for such constrained environments, where only a subset of expert weights fit into DRAM. Through empirical analysis, we show MoEs can tolerate careful deviations in expert selection with minimal predictive performance loss. Inspired by this observation, we propose a novel cache-aware routing strategy that leverages expert reuse during token generation to significantly improve cache locality. Evaluating on language modeling, MMLU, and GSM8K benchmarks, our method reduces cache miss rates by over 50%, with negligible impact on perplexity (0.1%–3%) and downstream task accuracy (<0.1%). Unlike prior methods limited by the optimal oracle cache bound, our approach surpasses this theoretical limit by allowing slight flexibility in expert selection. Finally, we present on-device results demonstrating 2$\times$ speedups on mobile hardware, offering a flexible and training-free solution to extend MoE's applicability across real-world applications.
Cite
Text
Skliar et al. "Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference." Transactions on Machine Learning Research, 2025.Markdown
[Skliar et al. "Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/skliar2025tmlr-mixture/)BibTeX
@article{skliar2025tmlr-mixture,
title = {{Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference}},
author = {Skliar, Andrii and van Rozendaal, Ties and Lepert, Romain and Boinovski, Todor and Van Baalen, Mart and Nagel, Markus and Whatmough, Paul N. and Bejnordi, Babak Ehteshami},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/skliar2025tmlr-mixture/}
}