Progressive Video Summarization via Multimodal Self-Supervised Learning

Abstract

Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both course-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score compared to the state of the art.

Cite

Text

Li et al. "Progressive Video Summarization via Multimodal Self-Supervised Learning." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Li et al. "Progressive Video Summarization via Multimodal Self-Supervised Learning." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/li2023wacv-progressive/)

BibTeX

@inproceedings{li2023wacv-progressive,
  title     = {{Progressive Video Summarization via Multimodal Self-Supervised Learning}},
  author    = {Li, Haopeng and Ke, Qiuhong and Gong, Mingming and Drummond, Tom},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {5584-5593},
  url       = {https://mlanthology.org/wacv/2023/li2023wacv-progressive/}
}