SF2T: Self-Supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Abstract
Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF^2T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos;(2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities.We assessed multiple models and validated the effectiveness of SF^2T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.
Cite
Text
Hu et al. "SF2T: Self-Supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02710Markdown
[Hu et al. "SF2T: Self-Supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/hu2025cvpr-sf2t/) doi:10.1109/CVPR52734.2025.02710BibTeX
@inproceedings{hu2025cvpr-sf2t,
title = {{SF2T: Self-Supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding}},
author = {Hu, Yangliu and Song, Zikai and Feng, Na and Luo, Yawei and Yu, Junqing and Chen, Yi-Ping Phoebe and Yang, Wei},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {29108-29117},
doi = {10.1109/CVPR52734.2025.02710},
url = {https://mlanthology.org/cvpr/2025/hu2025cvpr-sf2t/}
}