Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Wufei Ma, Kai Li, Zhongshi Jiang, Moustafa Meshry, Qihao Liu, Huiyu Wang, Christian Haene, Alan Yuille

ECCV 2024

doi:10.1007/978-3-031-72624-8_15 /eccv/2024/ma2024eccv-rethinking/

Abstract

Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this paper, we aim to better assess the capabilities of current video-text models and understand their limitations. We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new dataset. To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance. In order to narrow the gap between video-text models and human performance on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce , a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model. Experiments and analyses show that our approach successfully learn more discriminative action embeddings and improves results on when applied to multiple video-text models. Our dataset and project page is available here.

PDF ECCV Semantic Scholar

Cite

Text

Ma et al. "Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72624-8_15

Markdown

[Ma et al. "Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/ma2024eccv-rethinking/) doi:10.1007/978-3-031-72624-8_15

BibTeX

@inproceedings{ma2024eccv-rethinking,
  title     = {{Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data}},
  author    = {Ma, Wufei and Li, Kai and Jiang, Zhongshi and Meshry, Moustafa and Liu, Qihao and Wang, Huiyu and Haene, Christian and Yuille, Alan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72624-8_15},
  url       = {https://mlanthology.org/eccv/2024/ma2024eccv-rethinking/}
}