Verbs in Action: Improving Verb Understanding in Video-Language Models

Abstract

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available.

Cite

Text

Momeni et al. "Verbs in Action: Improving Verb Understanding in Video-Language Models." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01428

Markdown

[Momeni et al. "Verbs in Action: Improving Verb Understanding in Video-Language Models." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/momeni2023iccv-verbs/) doi:10.1109/ICCV51070.2023.01428

BibTeX

@inproceedings{momeni2023iccv-verbs,
  title     = {{Verbs in Action: Improving Verb Understanding in Video-Language Models}},
  author    = {Momeni, Liliane and Caron, Mathilde and Nagrani, Arsha and Zisserman, Andrew and Schmid, Cordelia},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {15579-15591},
  doi       = {10.1109/ICCV51070.2023.01428},
  url       = {https://mlanthology.org/iccv/2023/momeni2023iccv-verbs/}
}