Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Abstract

Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object correspondences and recognizing scene changes in a video clip, reflecting spatial and temporal granularity, respectively. To strengthen model's understanding into such fine-grained details, we propose a simple yet effective video-language modeling framework, S-ViLM, by exploiting the intrinsic structures of these two modalities. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features, simultaneously. Comprehensive evaluations demonstrate that S-ViLM performs favorably against existing approaches in learning more expressive representations. Specifically, S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.

Cite

Text

Xiong et al. "Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding." International Conference on Learning Representations, 2024.

Markdown

[Xiong et al. "Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/xiong2024iclr-structured/)

BibTeX

@inproceedings{xiong2024iclr-structured,
  title     = {{Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding}},
  author    = {Xiong, Yuanhao and Zhao, Long and Gong, Boqing and Yang, Ming-Hsuan and Schroff, Florian and Liu, Ting and Hsieh, Cho-Jui and Yuan, Liangzhe},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/xiong2024iclr-structured/}
}