Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards

Ma, Changlian; Huang, Zizheng; Zeng, Xiangyu; Wang, Yi; Liang, Cheng; Tian, Kun; Zhao, Xinhai; Wang, Limin

Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards

Changlian Ma, Zizheng Huang, Xiangyu Zeng, Yi Wang, Cheng Liang, Kun Tian, Xinhai Zhao, Limin Wang

ICLR 2026

/iclr/2026/ma2026iclr-balancing/

Abstract

Parameter-efficient Mixture-of-Experts (MoE) architectures, such as LoRA-MoE, enable strong and generalizable fine-tuning. However, a critical problem arises when fine-tuning these architectures with advanced reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO). Traditional supervised techniques are not naturally compatible with the GRPO objective, and naive combinations fail to effectively address routing collapse and the underutilization of MoE adapter parameters. To resolve this disconnect, we introduce Routing-Optimized Group Relative Policy Optimization (RO-GRPO), a mechanism-aware framework. It turns internal expert routing statistics collected during training into a direct reward signal, seamlessly integrating routing supervision into the reinforcement fine-tuning (RFT) process. This enables effective optimization of parameter utilization and improves performance on both unimodal and multimodal mathematical reasoning tasks, all without extra training stages. Our work provides the first demonstration that a scalar reward in GRPO can be engineered from a model's own internal mechanics to explicitly guide its optimization, extending alignment from mere behavior tuning to holistic mechanism alignment.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Ma et al. "Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards." International Conference on Learning Representations, 2026.

Markdown

[Ma et al. "Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ma2026iclr-balancing/)

BibTeX

@inproceedings{ma2026iclr-balancing,
  title     = {{Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards}},
  author    = {Ma, Changlian and Huang, Zizheng and Zeng, Xiangyu and Wang, Yi and Liang, Cheng and Tian, Kun and Zhao, Xinhai and Wang, Limin},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ma2026iclr-balancing/}
}