Verbs in Action: Improving Verb Understanding in Video-Language Models
Abstract
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available.
Cite
Text
Momeni et al. "Verbs in Action: Improving Verb Understanding in Video-Language Models." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01428Markdown
[Momeni et al. "Verbs in Action: Improving Verb Understanding in Video-Language Models." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/momeni2023iccv-verbs/) doi:10.1109/ICCV51070.2023.01428BibTeX
@inproceedings{momeni2023iccv-verbs,
title = {{Verbs in Action: Improving Verb Understanding in Video-Language Models}},
author = {Momeni, Liliane and Caron, Mathilde and Nagrani, Arsha and Zisserman, Andrew and Schmid, Cordelia},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {15579-15591},
doi = {10.1109/ICCV51070.2023.01428},
url = {https://mlanthology.org/iccv/2023/momeni2023iccv-verbs/}
}