Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization

Jingjing Li, Tianyu Yang, Wei Ji, Jue Wang, Li Cheng

CVPR 2022 pp. 19914-19924

doi:10.1109/CVPR52688.2022.01929 /cvpr/2022/li2022cvpr-exploring/

Abstract

Weakly-supervised temporal action localization aims to localize actions in untrimmed videos with only video-level labels. Most existing methods address this problem with a "localization-by-classification" pipeline that localizes action regions based on snippet-wise classification sequences. Snippet-wise classifications are unfortunately error prone due to the sparsity of video-level labels. Inspired by recent success in unsupervised contrastive representation learning, we propose a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting. This is enabled by three key designs: 1) an effective pseudo-label denoising module to alleviate the side effects caused by noisy contrastive features, 2) an efficient region-level feature contrast strategy with a region-level memory bank to capture "global" contrast across the entire dataset, and 3) a diverse contrastive learning strategy to enable action-background separation as well as intra-class compactness & inter-class separability. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate the superior performance of our approach.

PDF CVPR Semantic Scholar

Cite

Text

Li et al. "Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01929

Markdown

[Li et al. "Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/li2022cvpr-exploring/) doi:10.1109/CVPR52688.2022.01929

BibTeX

@inproceedings{li2022cvpr-exploring,
  title     = {{Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization}},
  author    = {Li, Jingjing and Yang, Tianyu and Ji, Wei and Wang, Jue and Cheng, Li},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {19914-19924},
  doi       = {10.1109/CVPR52688.2022.01929},
  url       = {https://mlanthology.org/cvpr/2022/li2022cvpr-exploring/}
}