Boosting Semi-Supervised Video Action Detection with Temporal Context

Abstract

This paper studies semi-supervised learning of video action detection (VAD) which assumes that only a small portion of training videos are labeled and the others remain unlabeled. The existing semi-supervised methods for VAD mainly focus on leveraging spatial context of unlabeled video lacking its exploration of temporal context. To resolve this we present a novel semi-supervised learning framework that effectively incorporates spatio-temporal context during training. We first introduce a new augmentation strategy called temporal cross-view augmentation to achieve robust representation across clips depicting the same action but not aligned on the time axis. We also propose a new context fusion method called global-local context fusion that effectively utilizes the spatio-temporal context of videos to enhances the features of each frame by incorporating those of other frames within a clip; this method aids in actively leveraging spatio-temporal context of video leading to significant performance improvement. Our framework was evaluated on UCF101-24 and JHMDB-21 where it outperformed all existing methods in every evaluation setting.

Cite

Text

Kwon et al. "Boosting Semi-Supervised Video Action Detection with Temporal Context." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Kwon et al. "Boosting Semi-Supervised Video Action Detection with Temporal Context." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/kwon2025wacv-boosting/)

BibTeX

@inproceedings{kwon2025wacv-boosting,
  title     = {{Boosting Semi-Supervised Video Action Detection with Temporal Context}},
  author    = {Kwon, Donghyeon and Kim, Inho and Kwak, Suha},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {847-858},
  url       = {https://mlanthology.org/wacv/2025/kwon2025wacv-boosting/}
}