Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Abstract

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

Cite

Text

Korbar et al. "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization." Neural Information Processing Systems, 2018.

Markdown

[Korbar et al. "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/korbar2018neurips-cooperative/)

BibTeX

@inproceedings{korbar2018neurips-cooperative,
  title     = {{Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization}},
  author    = {Korbar, Bruno and Tran, Du and Torresani, Lorenzo},
  booktitle = {Neural Information Processing Systems},
  year      = {2018},
  pages     = {7763-7774},
  url       = {https://mlanthology.org/neurips/2018/korbar2018neurips-cooperative/}
}