The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

Abstract

The fine-tuning of large vision-language foundation models remains an underexplored area, particularly regarding its impact on learning gains and catastrophic forgetting. Inspired by the significance of modality gaps in contrastive dual-encoders, we introduce the Inter-Intra Modal Measure (IIMM)--a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment. Through extensive empirical analysis across four state-of-the-art vision-language models and five fine-tuning techniques, we establish a strong linear relationship: tasks with higher IIMM scores yield greater in-domain performance improvements but suffer from more pronounced out-of-domain degradation, with some parameter-efficient fine-tuning (PEFT) methods exhibiting severe forgetting. Compared to existing transferability measures, the IIMM demonstrates significantly stronger predictive power for accuracy changes post fine-tuning in dual-encoder models. Moreover, we provide a theoretical bound, proving that changes in IIMM are limited by the Wasserstein distance between pre- and post-fine-tuning embedding distributions, ensuring its stability and robustness as a predictive measure. With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning. When combined with prior knowledge of a model's performance across diverse tasks, the IIMM further enhances transferability predictions for novel tasks, offering a lightweight yet effective tool for guiding model adaptation strategies. Our code is provided at https://github.com/mit-ll/IIMM.

Cite

Text

Niss et al. "The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models." International Conference on Computer Vision, 2025.

Markdown

[Niss et al. "The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/niss2025iccv-interintra/)

BibTeX

@inproceedings{niss2025iccv-interintra,
  title     = {{The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models}},
  author    = {Niss, Laura and Vogt-Lowell, Kevin and Tsiligkaridis, Theodoros},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {2396-2406},
  url       = {https://mlanthology.org/iccv/2025/niss2025iccv-interintra/}
}