WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding
Abstract
Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a language query. Existing techniques achieve such alignment by exploiting dense boundary and bounding box annotations, which can be prohibitively expensive. To bridge the gap, we investigate the weakly-supervised setting, where models learn from easily accessible video-language data without annotations. We identify that intra-sample spurious correlations among video-language components can be alleviated if the model captures the decomposed structures of video and language data. In this light, we propose a novel framework, namely WINNER, for hierarchical video-text understanding. WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos. The multi-modal decomposition tree serves as the basis for multi-hierarchy language-tube matching. A hierarchical contrastive learning objective is proposed to learn the multi-hierarchy correspondence and distinguishment with intra-sample and inter-sample video-text decomposition structures, achieving video-language decomposition structure alignment. Extensive experiments demonstrate the rationality of our design and its effectiveness beyond state-of-the-art weakly supervised methods, even some supervised methods.
Cite
Text
Li et al. "WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.02211Markdown
[Li et al. "WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/li2023cvpr-winner/) doi:10.1109/CVPR52729.2023.02211BibTeX
@inproceedings{li2023cvpr-winner,
title = {{WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding}},
author = {Li, Mengze and Wang, Han and Zhang, Wenqiao and Miao, Jiaxu and Zhao, Zhou and Zhang, Shengyu and Ji, Wei and Wu, Fei},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {23090-23099},
doi = {10.1109/CVPR52729.2023.02211},
url = {https://mlanthology.org/cvpr/2023/li2023cvpr-winner/}
}