Hierarchical Self-Supervised Representation Learning for Movie Understanding

Fanyi Xiao, Kaustav Kundu, Joseph Tighe, Davide Modolo

CVPR 2022 pp. 9727-9736

doi:10.1109/CVPR52688.2022.00950 /cvpr/2022/xiao2022cvpr-hierarchical/

Abstract

Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model. Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task, which enables the usage of different data sources for pretraining different levels of the hierarchy. We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores). We further demonstrate the effectiveness of our contextualized event features on LVU tasks, both when used alone and when combined with instance features, showing their complementarity.

PDF CVPR Semantic Scholar

Cite

Text

Xiao et al. "Hierarchical Self-Supervised Representation Learning for Movie Understanding." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00950

Markdown

[Xiao et al. "Hierarchical Self-Supervised Representation Learning for Movie Understanding." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/xiao2022cvpr-hierarchical/) doi:10.1109/CVPR52688.2022.00950

BibTeX

@inproceedings{xiao2022cvpr-hierarchical,
  title     = {{Hierarchical Self-Supervised Representation Learning for Movie Understanding}},
  author    = {Xiao, Fanyi and Kundu, Kaustav and Tighe, Joseph and Modolo, Davide},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {9727-9736},
  doi       = {10.1109/CVPR52688.2022.00950},
  url       = {https://mlanthology.org/cvpr/2022/xiao2022cvpr-hierarchical/}
}