WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

Li, Mengze; Wang, Han; Zhang, Wenqiao; Miao, Jiaxu; Zhao, Zhou; Zhang, Shengyu; Ji, Wei; Wu, Fei

doi:10.1109/CVPR52729.2023.02211

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

Mengze Li, Han Wang, Wenqiao Zhang, Jiaxu Miao, Zhou Zhao, Shengyu Zhang, Wei Ji, Fei Wu

CVPR 2023 pp. 23090-23099

doi:10.1109/CVPR52729.2023.02211 /cvpr/2023/li2023cvpr-winner/

Abstract

Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a language query. Existing techniques achieve such alignment by exploiting dense boundary and bounding box annotations, which can be prohibitively expensive. To bridge the gap, we investigate the weakly-supervised setting, where models learn from easily accessible video-language data without annotations. We identify that intra-sample spurious correlations among video-language components can be alleviated if the model captures the decomposed structures of video and language data. In this light, we propose a novel framework, namely WINNER, for hierarchical video-text understanding. WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos. The multi-modal decomposition tree serves as the basis for multi-hierarchy language-tube matching. A hierarchical contrastive learning objective is proposed to learn the multi-hierarchy correspondence and distinguishment with intra-sample and inter-sample video-text decomposition structures, achieving video-language decomposition structure alignment. Extensive experiments demonstrate the rationality of our design and its effectiveness beyond state-of-the-art weakly supervised methods, even some supervised methods.

PDF CVPR Semantic Scholar

Cite

Text

Li et al. "WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.02211

Markdown

[Li et al. "WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/li2023cvpr-winner/) doi:10.1109/CVPR52729.2023.02211

BibTeX

@inproceedings{li2023cvpr-winner,
  title     = {{WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding}},
  author    = {Li, Mengze and Wang, Han and Zhang, Wenqiao and Miao, Jiaxu and Zhao, Zhou and Zhang, Shengyu and Ji, Wei and Wu, Fei},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {23090-23099},
  doi       = {10.1109/CVPR52729.2023.02211},
  url       = {https://mlanthology.org/cvpr/2023/li2023cvpr-winner/}
}