HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Liu, Song; Fan, Haoqi; Qian, Shengsheng; Chen, Yiru; Ding, Wenkui; Wang, Zhongyuan

doi:10.1109/ICCV48922.2021.01170

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, Zhongyuan Wang

ICCV 2021 pp. 11915-11925

doi:10.1109/ICCV48922.2021.01170 /iccv/2021/liu2021iccv-hit/

Abstract

Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on the three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our method.

PDF ICCV Semantic Scholar

Cite

Text

Liu et al. "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.01170

Markdown

[Liu et al. "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/liu2021iccv-hit/) doi:10.1109/ICCV48922.2021.01170

BibTeX

@inproceedings{liu2021iccv-hit,
  title     = {{HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval}},
  author    = {Liu, Song and Fan, Haoqi and Qian, Shengsheng and Chen, Yiru and Ding, Wenkui and Wang, Zhongyuan},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {11915-11925},
  doi       = {10.1109/ICCV48922.2021.01170},
  url       = {https://mlanthology.org/iccv/2021/liu2021iccv-hit/}
}