Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval

Abstract

The performance of text-video retrieval has been significantly improved by vision-language cross-modal learning schemes. The typical solution is to directly align the global video-level and sentence-level features during learning, which would ignore the intrinsic video-text relations, i.e., a text description only corresponds to a spatio-temporal part of videos. Hence, the matching process should consider both fine-grained spatial content and various temporal semantic events. To this end, we propose a text-video learning framework with progressive spatio-temporal prototype matching. Specifically, the vanilla matching process is decomposed into two complementary phases: object-phrase prototype matching and event-sentence prototype matching. In the object-phrase prototype matching phase, a spatial prototype generation mechanism is developed to predict key patches or words, which are sparsely integrated into object or phrase prototypes. Importantly, optimizing the local alignment between object-phrase prototypes helps the model perceive spatial details. In the event-sentence prototype matching phase, we design a temporal prototype generation mechanism to associate intra-frame objects and interact inter-frame temporal relations. Such progressively generated event prototypes can reveal semantic diversity in videos for dynamic matching. Validated by comprehensive experiments, our method consistently outperforms the state-of-the-art methods on four video retrieval benchmarks.

Cite

Text

Li et al. "Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00379

Markdown

[Li et al. "Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/li2023iccv-progressive/) doi:10.1109/ICCV51070.2023.00379

BibTeX

@inproceedings{li2023iccv-progressive,
  title     = {{Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval}},
  author    = {Li, Pandeng and Xie, Chen-Wei and Zhao, Liming and Xie, Hongtao and Ge, Jiannan and Zheng, Yun and Zhao, Deli and Zhang, Yongdong},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {4100-4110},
  doi       = {10.1109/ICCV51070.2023.00379},
  url       = {https://mlanthology.org/iccv/2023/li2023iccv-progressive/}
}