Semi-Supervised Temporal Action Detection with Proposal-Free Masking
Abstract
Existing temporal action detection (TAD) methods rely on a large number of training data with segment-level annotations. Collecting and annotating such a training set is thus highly expensive and unscalable. Semi-supervised TAD (SS-TAD) alleviates this problem by leveraging unlabeled videos freely available at scale. However, SS-TAD is also a much more challenging problem than supervised TAD, and consequently much under-studied. Prior SS-TAD methods directly combine an existing proposal-based TAD method and a SSL method. Due to their sequential localization (e.g, proposal generation) and classification design, they are prone to proposal error propagation. To overcome this limitation, in this work we propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT) with a parallel localization (mask generation) and classification architecture. Such a novel design effectively eliminates the dependence between localization and classification by cutting off the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for prediction refinement, and a new pretext task for self-supervised model pre-training. Extensive experiments on two standard benchmarks show that our SPOT outperforms state-of-the-art alternatives, often by a large margin. The PyTorch implementation of SPOT is available at https://github.com/sauradip/SPOT
Cite
Text
Nag et al. "Semi-Supervised Temporal Action Detection with Proposal-Free Masking." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20062-5_38Markdown
[Nag et al. "Semi-Supervised Temporal Action Detection with Proposal-Free Masking." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/nag2022eccv-semisupervised/) doi:10.1007/978-3-031-20062-5_38BibTeX
@inproceedings{nag2022eccv-semisupervised,
title = {{Semi-Supervised Temporal Action Detection with Proposal-Free Masking}},
author = {Nag, Sauradip and Zhu, Xiatian and Song, Yi-Zhe and Xiang, Tao},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-20062-5_38},
url = {https://mlanthology.org/eccv/2022/nag2022eccv-semisupervised/}
}