Domain Adaptation of VLM for Soccer Video Understanding

Abstract

Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains underexplored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.

Cite

Text

Jiang et al. "Domain Adaptation of VLM for Soccer Video Understanding." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Jiang et al. "Domain Adaptation of VLM for Soccer Video Understanding." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/jiang2025cvprw-domain/)

BibTeX

@inproceedings{jiang2025cvprw-domain,
  title     = {{Domain Adaptation of VLM for Soccer Video Understanding}},
  author    = {Jiang, Tiancheng and Wang, Henry and Salekin, Md Sirajus and Atighehchian, Parmida and Zhang, Shinan},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {6111-6121},
  url       = {https://mlanthology.org/cvprw/2025/jiang2025cvprw-domain/}
}