VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Li, Shicheng; Li, Lei; Liu, Yi; Ren, Shuhuai; Liu, Yuanxin; Gao, Rundong; Sun, Xu; Hou, Lu

doi:10.1007/978-3-031-72897-6_19

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Shicheng Li, Lei Li, Yi Liu, Shuhuai Ren, Yuanxin Liu, Rundong Gao, Xu Sun, Lu Hou

ECCV 2024

doi:10.1007/978-3-031-72897-6_19 /eccv/2024/li2024eccv-vitatecs/

Abstract

The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research. Our dataset is publicly available at https://github.com/lscpku/VITATECS.

PDF ECCV Semantic Scholar

Cite

Text

Li et al. "VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72897-6_19

Markdown

[Li et al. "VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/li2024eccv-vitatecs/) doi:10.1007/978-3-031-72897-6_19

BibTeX

@inproceedings{li2024eccv-vitatecs,
  title     = {{VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models}},
  author    = {Li, Shicheng and Li, Lei and Liu, Yi and Ren, Shuhuai and Liu, Yuanxin and Gao, Rundong and Sun, Xu and Hou, Lu},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72897-6_19},
  url       = {https://mlanthology.org/eccv/2024/li2024eccv-vitatecs/}
}