VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Wasim, Syed Talal; Naseer, Muzammal; Khan, Salman; Yang, Ming-Hsuan; Khan, Fahad Shahbaz

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

CVPR 2024 pp. 18909-18918

/cvpr/2024/wasim2024cvpr-videogroundingdino/

Abstract

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions our model surpasses the recent best-performing models by 4.88 m_vIoU and 1.83 accuracy demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

PDF CVPR Semantic Scholar

Cite

Text

Wasim et al. "VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding." Conference on Computer Vision and Pattern Recognition, 2024.

Markdown

[Wasim et al. "VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wasim2024cvpr-videogroundingdino/)

BibTeX

@inproceedings{wasim2024cvpr-videogroundingdino,
  title     = {{VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding}},
  author    = {Wasim, Syed Talal and Naseer, Muzammal and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad Shahbaz},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {18909-18918},
  url       = {https://mlanthology.org/cvpr/2024/wasim2024cvpr-videogroundingdino/}
}