Semi-Supervised Video Paragraph Grounding with Contrastive Encoder
Abstract
Video events grounding aims at retrieving the most relevant moments from an untrimmed video in terms of a given natural language query. Most previous works focus on Video Sentence Grounding (VSG), which localizes the moment with a sentence query. Recently, researchers extended this task to Video Paragraph Grounding (VPG) by retrieving multiple events with a paragraph. However, we find the existing VPG methods may not perform well on context modeling and highly rely on video-paragraph annotations. To tackle this problem, we propose a novel VPG method termed Semi-supervised Video-Paragraph TRansformer (SVPTR), which can more effectively exploit contextual information in paragraphs and significantly reduce the dependency on annotated data. Our SVPTR method consists of two key components: (1) a base model VPTR that learns the video-paragraph alignment with contrastive encoders and tackles the lack of sentence-level contextual interactions and (2) a semi-supervised learning framework with multimodal feature perturbations that reduces the requirements of annotated training data. We evaluate our model on three widely-used video grounding datasets, i.e., ActivityNet-Caption, Charades-CD-OOD, and TACoS. The experimental results show that our SVPTR method establishes the new state-of-the-art performance on all datasets. Even under the conditions of fewer annotations, it can also achieve competitive results compared with recent VPG methods.
Cite
Text
Jiang et al. "Semi-Supervised Video Paragraph Grounding with Contrastive Encoder." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00250Markdown
[Jiang et al. "Semi-Supervised Video Paragraph Grounding with Contrastive Encoder." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/jiang2022cvpr-semisupervised/) doi:10.1109/CVPR52688.2022.00250BibTeX
@inproceedings{jiang2022cvpr-semisupervised,
title = {{Semi-Supervised Video Paragraph Grounding with Contrastive Encoder}},
author = {Jiang, Xun and Xu, Xing and Zhang, Jingran and Shen, Fumin and Cao, Zuo and Shen, Heng Tao},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {2466-2475},
doi = {10.1109/CVPR52688.2022.00250},
url = {https://mlanthology.org/cvpr/2022/jiang2022cvpr-semisupervised/}
}