Rethinking Weakly-Supervised Video Temporal Grounding from a Game Perspective

Abstract

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction. Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization. Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

Cite

Text

Fang et al. "Rethinking Weakly-Supervised Video Temporal Grounding from a Game Perspective." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72995-9_17

Markdown

[Fang et al. "Rethinking Weakly-Supervised Video Temporal Grounding from a Game Perspective." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/fang2024eccv-rethinking/) doi:10.1007/978-3-031-72995-9_17

BibTeX

@inproceedings{fang2024eccv-rethinking,
  title     = {{Rethinking Weakly-Supervised Video Temporal Grounding from a Game Perspective}},
  author    = {Fang, Xiang and Xiong, Zeyu and Fang, Wanlong and Qu, Xiaoye and Chen, Chen and Dong, Jianfeng and Tang, Keke and Zhou, Pan and Cheng, Yu and Liu, Daizong},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72995-9_17},
  url       = {https://mlanthology.org/eccv/2024/fang2024eccv-rethinking/}
}