M$^3$E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts

Abstract

Vision-and-Language Navigation (VLN) agents have shown strong capabilities in following natural language instructions. However, they often struggle to generalize across environments due to catastrophic forgetting, which limits their practical use in real-world settings where agents must continually adapt to new domains. We argue that overcoming forgetting across environments hinges on decoupling global scene reasoning from local perceptual alignment, allowing the agent to adapt to new domains while preserving specialized capabilities. To this end, we propose M$^3$E, the Mixture of Macro and Micro Experts, an environment-aware hierarchical MoE framework for continual VLN. Our method introduces a dual-router architecture that separates navigation into two levels of reasoning. A macro-level, scene-aware router selects strategy experts based on global environmental features (e.g., office vs. residential), while a micro-level, instance-aware router activates perception experts based on local instruction-vision alignment for step-wise decision making. To preserve knowledge across domains, we adopt a dynamic momentum update strategy that identifies expert utility in new environments and selectively updates or freezes their parameters. We evaluate M$^3$E in a domain-incremental setting on the R2R and REVERIE datasets, where agents learn across unseen scenes without revisiting prior data. Results show that our method consistently outperforms standard fine-tuning and existing continual learning baselines in both adaptability and knowledge retention, offering a parameter-efficient solution for building generalizable embodied agents.

Cite

Text

Jiang et al. "M$^3$E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts." International Conference on Learning Representations, 2026.

Markdown

[Jiang et al. "M$^3$E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/jiang2026iclr-3e/)

BibTeX

@inproceedings{jiang2026iclr-3e,
  title     = {{M$^3$E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts}},
  author    = {Jiang, Yongliang and Zhang, Huaidong and Luo, Xuandi and He, Shengfeng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/jiang2026iclr-3e/}
}