Static and Dynamic Concepts for Self-Supervised Video Representation Learning

Abstract

In this paper, we propose a novel learning scheme for self-supervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding. Specifically, we utilize static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space. We add diversity and fidelity regularizations to guarantee that we learn a compact set of meaningful concepts. Then we employ a cross-attention mechanism to aggregate detailed local features of different concepts, and filter out redundant concepts with low activations to perform local concept contrast. Extensive experiments demonstrate that our method distills meaningful static and dynamic concepts to guide video understanding, and obtains state-of-the-art results on UCF-101, HMDB-51, and Diving-48.

Cite

Text

Qian et al. "Static and Dynamic Concepts for Self-Supervised Video Representation Learning." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19809-0_9

Markdown

[Qian et al. "Static and Dynamic Concepts for Self-Supervised Video Representation Learning." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/qian2022eccv-static/) doi:10.1007/978-3-031-19809-0_9

BibTeX

@inproceedings{qian2022eccv-static,
  title     = {{Static and Dynamic Concepts for Self-Supervised Video Representation Learning}},
  author    = {Qian, Rui and Ding, Shuangrui and Liu, Xian and Lin, Dahua},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19809-0_9},
  url       = {https://mlanthology.org/eccv/2022/qian2022eccv-static/}
}