Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding
Abstract
Video Paragraph Grounding (VPG) is an emerging task in video-language understanding which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning outperforming state-of-the-art methods trained with the same or stronger supervision.
Cite
Text
Tan et al. "Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01288Markdown
[Tan et al. "Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/tan2024cvpr-siamese/) doi:10.1109/CVPR52733.2024.01288BibTeX
@inproceedings{tan2024cvpr-siamese,
title = {{Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding}},
author = {Tan, Chaolei and Lai, Jianhuang and Zheng, Wei-Shi and Hu, Jian-Fang},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13569-13580},
doi = {10.1109/CVPR52733.2024.01288},
url = {https://mlanthology.org/cvpr/2024/tan2024cvpr-siamese/}
}