Exploring the Better Multimodal Synergy Strategy for Vision-Language Models

Yin, Xiaotian; Liu, Xin; Chen, Si; Wang, Yuan; Pan, Yuwen; Zhang, Tianzhu

doi:10.1609/AAAI.V39I21.34372

Exploring the Better Multimodal Synergy Strategy for Vision-Language Models

Xiaotian Yin, Xin Liu, Si Chen, Yuan Wang, Yuwen Pan, Tianzhu Zhang

AAAI 2025 pp. 22182-22190

doi:10.1609/AAAI.V39I21.34372 /aaai/2025/yin2025aaai-exploring-a/

Abstract

Vision-Language models (VLMs) have shown great potential in enhancing open-world visual concept comprehension. Recent researches focus on an optimum multimodal collaboration strategy that significantly advances CLIP-based few-shot tasks. However, existing prompt-based solutions suffer from unidirectional information flow and increased parameters since they explicitly condition the vision prompts on textual prompts across different transformer layers using non-shareable coupling functions. To address this issue, we propose a Dual-shared mechanism based on LoRA (DsRA) that addresses VLM adaptation in low-data regimes. The proposed DsRA enjoys several merits. First, we design an inter-modal shared coefficient that focuses on capturing visual and textual shared patterns, ensuring effective mutual synergy between image and text features. Second, an intra-modal shared matrix is proposed to achieve efficient parameter fine-tuning by combining the different coefficients to generate layer-wise adapters placed in encoder layers. Our extensive experiments demonstrate that DsRA improves the generalizability under few-shot classification, base-to-new generalization, and domain generalization settings. Our code will be released soon.

PDF AAAI Semantic Scholar

Cite

Text

Yin et al. "Exploring the Better Multimodal Synergy Strategy for Vision-Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I21.34372

Markdown

[Yin et al. "Exploring the Better Multimodal Synergy Strategy for Vision-Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/yin2025aaai-exploring-a/) doi:10.1609/AAAI.V39I21.34372

BibTeX

@inproceedings{yin2025aaai-exploring-a,
  title     = {{Exploring the Better Multimodal Synergy Strategy for Vision-Language Models}},
  author    = {Yin, Xiaotian and Liu, Xin and Chen, Si and Wang, Yuan and Pan, Yuwen and Zhang, Tianzhu},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {22182-22190},
  doi       = {10.1609/AAAI.V39I21.34372},
  url       = {https://mlanthology.org/aaai/2025/yin2025aaai-exploring-a/}
}