Self-Supervised MultiModal Versatile Networks

Abstract

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

Cite

Text

Alayrac et al. "Self-Supervised MultiModal Versatile Networks." Neural Information Processing Systems, 2020.

Markdown

[Alayrac et al. "Self-Supervised MultiModal Versatile Networks." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/alayrac2020neurips-selfsupervised/)

BibTeX

@inproceedings{alayrac2020neurips-selfsupervised,
  title     = {{Self-Supervised MultiModal Versatile Networks}},
  author    = {Alayrac, Jean-Baptiste and Recasens, Adria and Schneider, Rosalia and Arandjelović, Relja and Ramapuram, Jason and De Fauw, Jeffrey and Smaira, Lucas and Dieleman, Sander and Zisserman, Andrew},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/alayrac2020neurips-selfsupervised/}
}