From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

Abstract

Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants with a single fine-tuning step, utilizing only $5B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance.

Cite

Text

Nishu et al. "From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Nishu et al. "From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/nishu2025iclrw-dense/)

BibTeX

@inproceedings{nishu2025iclrw-dense,
  title     = {{From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs}},
  author    = {Nishu, Kumari and Mehta, Sachin and Abnar, Samira and Farajtabar, Mehrdad and Horton, Maxwell and Najibi, Mahyar and Nabi, Moin and Cho, Minsik and Naik, Devang},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/nishu2025iclrw-dense/}
}