Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-Training Towards Retrieval

Abstract

Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-training methods suffer from the semantic misalignments. The reason is that these methods ignore sequence alignments but focusing on critical token alignment. To alleviate the problem, we propose a video-language pre-training framework, termed videolanguage pre-training For lEarning sEmantic aLignments (FEEL), to learn semantic alignments at the sequence level. Specifically, the global modality reconstruction and the cross- modal self-contrasting method is utilized to learn the alignments at the sequence level better. Extensive experimental results demonstrate the effectiveness of FEEL on text-based video retrieval and text-based video corpus moment retrieval.

Cite

Text

Li et al. "Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-Training Towards Retrieval." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I1.25222

Markdown

[Li et al. "Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-Training Towards Retrieval." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/li2023aaai-learning-e/) doi:10.1609/AAAI.V37I1.25222

BibTeX

@inproceedings{li2023aaai-learning-e,
  title     = {{Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-Training Towards Retrieval}},
  author    = {Li, Mingchao and Shi, Xiaoming and Leng, Haitao and Zhou, Wei and Zheng, Hai-Tao and Zhang, Kuncai},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {1377-1385},
  doi       = {10.1609/AAAI.V37I1.25222},
  url       = {https://mlanthology.org/aaai/2023/li2023aaai-learning-e/}
}