MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound

Abstract

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

Cite

Text

Zellers et al. "MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01589

Markdown

[Zellers et al. "MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/zellers2022cvpr-merlot/) doi:10.1109/CVPR52688.2022.01589

BibTeX

@inproceedings{zellers2022cvpr-merlot,
  title     = {{MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound}},
  author    = {Zellers, Rowan and Lu, Jiasen and Lu, Ximing and Yu, Youngjae and Zhao, Yanpeng and Salehi, Mohammadreza and Kusupati, Aditya and Hessel, Jack and Farhadi, Ali and Choi, Yejin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {16375-16387},
  doi       = {10.1109/CVPR52688.2022.01589},
  url       = {https://mlanthology.org/cvpr/2022/zellers2022cvpr-merlot/}
}