Stable Mean Teacher for Semi-Supervised Video Action Detection

Abstract

In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatio-temporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end student-teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel ErrOr Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatio-temporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To overcome this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, which leads to coherent temporal detections. We evaluate our approach on four different spatio-temporal detection benchmarks: UCF101-24, JHMDB21, AVA, and Youtube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides a competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21 respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and Youtube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain.

Cite

Text

Kumar et al. "Stable Mean Teacher for Semi-Supervised Video Action Detection." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32465

Markdown

[Kumar et al. "Stable Mean Teacher for Semi-Supervised Video Action Detection." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/kumar2025aaai-stable/) doi:10.1609/AAAI.V39I4.32465

BibTeX

@inproceedings{kumar2025aaai-stable,
  title     = {{Stable Mean Teacher for Semi-Supervised Video Action Detection}},
  author    = {Kumar, Akash and Mitra, Sirshapan and Rawat, Yogesh Singh},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {4419-4427},
  doi       = {10.1609/AAAI.V39I4.32465},
  url       = {https://mlanthology.org/aaai/2025/kumar2025aaai-stable/}
}