Knowledge Transfer from Interaction Learning

Abstract

Current visual foundation models (VFMs) face a fundamental limitation in transferring knowledge from vision language models (VLMs): while VLMs excel at modeling cross-modal interactions through unified representation spaces, existing VFMs predominantly adopt result-oriented paradigms that neglect the underlying interaction processes. This representational discrepancy leads to suboptimal knowledge transfer and limited generalization capabilities across vision tasks.We propose Learning from Interactions, a cognitive-inspired framework that bridges this gap by explicitly modeling interactions during visual understanding. Our key insight is that preserving the interaction dynamics captured by VLMs -- rather than just their final representations -- enables more effective knowledge transfer to downstream VFMs. The technical core involves two innovations: (1) Interaction Queries that maintain persistent relationships across network layers, and (2) interaction-based supervision derived from pre-trained VLMs' cross-modal attention patterns.Comprehensive experiments demonstrate consistent improvements across multiple benchmarks: achieving ~3.3% and +1.6 mAP/+2.4 AP^ mask absolute gains on TinyImageNet classification and COCO detection/segmentation respectively, with minimal parameter overhead and faster convergence (7xspeedup). The framework particularly excels in cross-domain scenarios, delivering ~2.4% and ~9.3% zero-shot improvements on PACS and VLCS. Human evaluations confirm our approach's cognitive alignment, outperforming result-oriented methods by 2.7xin semantic consistency metrics.

Cite

Text

Gao et al. "Knowledge Transfer from Interaction Learning." International Conference on Computer Vision, 2025.

Markdown

[Gao et al. "Knowledge Transfer from Interaction Learning." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/gao2025iccv-knowledge/)

BibTeX

@inproceedings{gao2025iccv-knowledge,
  title     = {{Knowledge Transfer from Interaction Learning}},
  author    = {Gao, Yilin and Chen, Kangyi and Peng, Zhongxing and Lu, Hengjie and Xu, Shugong},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {3585-3595},
  url       = {https://mlanthology.org/iccv/2025/gao2025iccv-knowledge/}
}