How to Merge Your Multimodal Models over Time?

Abstract

Model merging combines expert models---each finetuned from a shared foundation model on diverse tasks and domains---into a single, more capable base model. However, existing model merging approaches assume all experts to be available simultaneously. In reality, new tasks and domains emerge continuously, prompting the need for a dynamic process of integrating these experts over time, which we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work: At each task, should expert training start from merged previous experts or the original base model? Should all models be merged at every time step? Which merging techniques are best suited for temporal merging? Should different strategies be used for the training initialization and deployment phases? To tackle these questions, we propose a unified framework called TIME---Temporal Integration of Model Expertise---that defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Utilizing TIME, we study temporal model merging across model sizes, tasks, and compute budgets on the large-scale FoMo-in-Flux benchmark for continual multimodal pretraining. Systematic experiments across TIME and FoMo-in-Flux allow us to arrive at several crucial key insights for temporal model merging to better understand current limits and best practices for successful model merging across time.

Cite

Text

Dziadzio et al. "How to Merge Your Multimodal Models over Time?." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01907

Markdown

[Dziadzio et al. "How to Merge Your Multimodal Models over Time?." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/dziadzio2025cvpr-merge/) doi:10.1109/CVPR52734.2025.01907

BibTeX

@inproceedings{dziadzio2025cvpr-merge,
  title     = {{How to Merge Your Multimodal Models over Time?}},
  author    = {Dziadzio, Sebastian and Udandarao, Vishaal and Roth, Karsten and Prabhu, Ameya and Akata, Zeynep and Albanie, Samuel and Bethge, Matthias},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {20479-20491},
  doi       = {10.1109/CVPR52734.2025.01907},
  url       = {https://mlanthology.org/cvpr/2025/dziadzio2025cvpr-merge/}
}