Video-Text Pre-Training with Learned Regions for Retrieval

Yan, Rui; Shou, Mike Zheng; Ge, Yixiao; Wang, Jinpeng; Lin, Xudong; Cai, Guanyu; Tang, Jinhui

doi:10.1609/AAAI.V37I3.25414

Video-Text Pre-Training with Learned Regions for Retrieval

Rui Yan, Mike Zheng Shou, Yixiao Ge, Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang

AAAI 2023 pp. 3100-3108

doi:10.1609/AAAI.V37I3.25414 /aaai/2023/yan2023aaai-video/

Abstract

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatio-temporal structure of objects in video, which yet has a strong synergy with nouns in textual descriptions. In this work, we propose a simple yet effective module for video-text representation learning, namely RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs. Given a video, our module (1) first quantizes continuous visual features via clustering patch-features into the same cluster according to content similarity, then (2) generates learnable masks to aggregate fragmentary features into regions with complete semantics, and finally (3) models the spatio-temporal dependencies between different semantic regions. In contrast to using off-the-shelf object detectors, our proposed module does not require explicit supervision and is much more computationally efficient. We pre-train the proposed approach on the public WebVid2M and CC3M datasets. Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our RegionLearner.

PDF AAAI Semantic Scholar

Cite

Text

Yan et al. "Video-Text Pre-Training with Learned Regions for Retrieval." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I3.25414

Markdown

[Yan et al. "Video-Text Pre-Training with Learned Regions for Retrieval." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/yan2023aaai-video/) doi:10.1609/AAAI.V37I3.25414

BibTeX

@inproceedings{yan2023aaai-video,
  title     = {{Video-Text Pre-Training with Learned Regions for Retrieval}},
  author    = {Yan, Rui and Shou, Mike Zheng and Ge, Yixiao and Wang, Jinpeng and Lin, Xudong and Cai, Guanyu and Tang, Jinhui},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {3100-3108},
  doi       = {10.1609/AAAI.V37I3.25414},
  url       = {https://mlanthology.org/aaai/2023/yan2023aaai-video/}
}