HiVLP: Hierarchical Interactive Video-Language Pre-Training

Abstract

Video-Language Pre-training (VLP) has become one of the most popular research topics in deep learning. However, compared to image-language pre-training, VLP has lagged far behind due to the lack of large amounts of video-text pairs. In this work, we train a VLP model with a hybrid of image-text and video-text pairs, which significantly outperforms pre-training with only the video-text pairs. Besides, existing methods usually model the cross-modal interaction using cross-attention between single-scale visual tokens and textual tokens. These visual features are either of low resolutions lacking fine-grained information, or of high resolutions without high-level semantics. To address the issue, we propose Hierarchical interactive Video-Language Pre-training (HiVLP) that efficiently uses a hierarchical visual feature group for multi-modal cross-attention during pre-training. In the hierarchical framework, low-resolution features are learned with focus on more global high-level semantic information, while high-resolution features carry fine-grained details. As a result, HiVLP has the ability to effectively learn both the global and fine-grained representations to achieve better alignment between video and text inputs. Furthermore, we design a hierarchical multi-scale vision contrastive loss for self-supervised learning to boost the interaction between them. Experimental results show that HiVLP establishes new state-of-the-art results in three downstream tasks, text-video retrieval, video-text retrieval, and video captioning.

Cite

Text

Shao et al. "HiVLP: Hierarchical Interactive Video-Language Pre-Training." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01265

Markdown

[Shao et al. "HiVLP: Hierarchical Interactive Video-Language Pre-Training." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/shao2023iccv-hivlp/) doi:10.1109/ICCV51070.2023.01265

BibTeX

@inproceedings{shao2023iccv-hivlp,
  title     = {{HiVLP: Hierarchical Interactive Video-Language Pre-Training}},
  author    = {Shao, Bin and Liu, Jianzhuang and Pei, Renjing and Xu, Songcen and Dai, Peng and Lu, Juwei and Li, Weimian and Yan, Youliang},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {13756-13766},
  doi       = {10.1109/ICCV51070.2023.01265},
  url       = {https://mlanthology.org/iccv/2023/shao2023iccv-hivlp/}
}