Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Abstract

In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.

Cite

Text

Pian et al. "Modality-Inconsistent Continual Learning of Multimodal Large Language Models." Transactions on Machine Learning Research, 2026.

Markdown

[Pian et al. "Modality-Inconsistent Continual Learning of Multimodal Large Language Models." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/pian2026tmlr-modalityinconsistent/)

BibTeX

@article{pian2026tmlr-modalityinconsistent,
  title     = {{Modality-Inconsistent Continual Learning of Multimodal Large Language Models}},
  author    = {Pian, Weiguo and Deng, Shijian and Mo, Shentong and Liu, Mingrui and Guo, Yunhui and Tian, Yapeng},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/pian2026tmlr-modalityinconsistent/}
}