SF2T: Self-Supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

Abstract

Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF^2T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos;(2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities.We assessed multiple models and validated the effectiveness of SF^2T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.

Cite

Text

Hu et al. "SF2T: Self-Supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02710

Markdown

[Hu et al. "SF2T: Self-Supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/hu2025cvpr-sf2t/) doi:10.1109/CVPR52734.2025.02710

BibTeX

@inproceedings{hu2025cvpr-sf2t,
  title     = {{SF2T: Self-Supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding}},
  author    = {Hu, Yangliu and Song, Zikai and Feng, Na and Luo, Yawei and Yu, Junqing and Chen, Yi-Ping Phoebe and Yang, Wei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29108-29117},
  doi       = {10.1109/CVPR52734.2025.02710},
  url       = {https://mlanthology.org/cvpr/2025/hu2025cvpr-sf2t/}
}