Audiovisual Masked Autoencoders

Abstract

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Cite

Text

Georgescu et al. "Audiovisual Masked Autoencoders." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01479

Markdown

[Georgescu et al. "Audiovisual Masked Autoencoders." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/georgescu2023iccv-audiovisual/) doi:10.1109/ICCV51070.2023.01479

BibTeX

@inproceedings{georgescu2023iccv-audiovisual,
  title     = {{Audiovisual Masked Autoencoders}},
  author    = {Georgescu, Mariana-Iuliana and Fonseca, Eduardo and Ionescu, Radu Tudor and Lucic, Mario and Schmid, Cordelia and Arnab, Anurag},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {16144-16154},
  doi       = {10.1109/ICCV51070.2023.01479},
  url       = {https://mlanthology.org/iccv/2023/georgescu2023iccv-audiovisual/}
}