STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
Abstract
In this work, we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a progressive learning framework with two key modules: Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.
Cite
Text
Garg et al. "STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00321Markdown
[Garg et al. "STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/garg2025cvpr-stpro/) doi:10.1109/CVPR52734.2025.00321BibTeX
@inproceedings{garg2025cvpr-stpro,
title = {{STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding}},
author = {Garg, Aaryan and Kumar, Akash and Rawat, Yogesh S},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {3384-3394},
doi = {10.1109/CVPR52734.2025.00321},
url = {https://mlanthology.org/cvpr/2025/garg2025cvpr-stpro/}
}