Long-Range Multimodal Pretraining for Movie Understanding

Abstract

Learning computer vision models from (and for) movies has a long-standing history. While great progress has been attained, there is still a need for a pretrained multimodal model that can perform well in the ever-growing set of movie understanding tasks the community has been establishing. In this work, we introduce Long-range Multimodal Pretraining, a strategy, and a model that leverages movie data to train transferable multimodal and cross-modal encoders. Our key idea is to learn from all modalities in a movie by observing and extracting relationships over a long-range. After pretraining, we run ablation studies on the LVU benchmark and validate our modeling choices and the importance of learning from long-range time spans. Our model achieves state-of-the-art on several LVU tasks while being much more data efficient than previous works. Finally, we evaluate our model's transferability by setting a new state-of-the-art in five different benchmarks.

Cite

Text

Argaw et al. "Long-Range Multimodal Pretraining for Movie Understanding." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01232

Markdown

[Argaw et al. "Long-Range Multimodal Pretraining for Movie Understanding." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/argaw2023iccv-longrange/) doi:10.1109/ICCV51070.2023.01232

BibTeX

@inproceedings{argaw2023iccv-longrange,
  title     = {{Long-Range Multimodal Pretraining for Movie Understanding}},
  author    = {Argaw, Dawit Mureja and Lee, Joon-Young and Woodson, Markus and Kweon, In So and Heilbron, Fabian Caba},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {13392-13403},
  doi       = {10.1109/ICCV51070.2023.01232},
  url       = {https://mlanthology.org/iccv/2023/argaw2023iccv-longrange/}
}