REMEDY: Recipe Merging Dynamics in Large Vision-Language Models

Abstract

Model merging has emerged as a powerful technique for combining task-specific vision models into a unified and multi-functional model. Previous methods represented by task arithmetic, have demonstrated effectiveness and scalability in this domain. When large vision-language models (LVLMs) arise with model size scaling up, this design becomes challenging to fuse different instruction-tuned LVLMs for generalization enhancement. The large scale and multi-modal nature of LVLMs present unique obstacles, including constructing reusable and modular components to accommodate the multi-component architecture of LVLMs and the requirement for dynamic fusion based on multi-modal input tokens. To address these challenges, we propose the \textbf{RE}cipe \textbf{ME}rging \textbf{DY}namics (REMEDY) method, a scalable and flexible paradigm for model merging in LVLMs. We first define reusable modules termed \textit{recipes} including the projector and shallow LLM layers, enhancing visual-language understanding. Then, we introduce a modality-aware allocator dynamically generates weights in a one-shot manner based on input relevance to existing recipes, enabling efficient cross-modal knowledge integration. REMEDY thus offers an adaptive solution for LVLMs to tackle both seen (i.e., multi-task learning) and unseen (i.e., zero-shot generalization) tasks. Experimental results demonstrate that our method consistently improves performance on both seen and unseen tasks, underscoring the effectiveness of REMEDY in diverse multi-modal scenarios.

Cite

Text

Zhu et al. "REMEDY: Recipe Merging Dynamics in Large Vision-Language Models." International Conference on Learning Representations, 2025.

Markdown

[Zhu et al. "REMEDY: Recipe Merging Dynamics in Large Vision-Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/zhu2025iclr-remedy/)

BibTeX

@inproceedings{zhu2025iclr-remedy,
  title     = {{REMEDY: Recipe Merging Dynamics in Large Vision-Language Models}},
  author    = {Zhu, Didi and Song, Yibing and Shen, Tao and Zhao, Ziyu and Yang, Jinluan and Zhang, Min and Wu, Chao},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/zhu2025iclr-remedy/}
}