VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Abstract

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions our model surpasses the recent best-performing models by 4.88 m_vIoU and 1.83 accuracy demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

Cite

Text

Wasim et al. "VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding." Conference on Computer Vision and Pattern Recognition, 2024.

Markdown

[Wasim et al. "VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wasim2024cvpr-videogroundingdino/)

BibTeX

@inproceedings{wasim2024cvpr-videogroundingdino,
  title     = {{VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding}},
  author    = {Wasim, Syed Talal and Naseer, Muzammal and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad Shahbaz},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {18909-18918},
  url       = {https://mlanthology.org/cvpr/2024/wasim2024cvpr-videogroundingdino/}
}