Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Abstract

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound.

Cite

Text

Sarkar and Etemad. "Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I8.26162

Markdown

[Sarkar and Etemad. "Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/sarkar2023aaai-self/) doi:10.1609/AAAI.V37I8.26162

BibTeX

@inproceedings{sarkar2023aaai-self,
  title     = {{Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity}},
  author    = {Sarkar, Pritam and Etemad, Ali},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {9723-9732},
  doi       = {10.1609/AAAI.V37I8.26162},
  url       = {https://mlanthology.org/aaai/2023/sarkar2023aaai-self/}
}