Stable Mean Teacher for Semi-Supervised Video Action Detection
Abstract
In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatio-temporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end student-teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel ErrOr Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatio-temporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To overcome this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, which leads to coherent temporal detections. We evaluate our approach on four different spatio-temporal detection benchmarks: UCF101-24, JHMDB21, AVA, and Youtube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides a competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21 respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and Youtube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain.
Cite
Text
Kumar et al. "Stable Mean Teacher for Semi-Supervised Video Action Detection." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32465Markdown
[Kumar et al. "Stable Mean Teacher for Semi-Supervised Video Action Detection." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/kumar2025aaai-stable/) doi:10.1609/AAAI.V39I4.32465BibTeX
@inproceedings{kumar2025aaai-stable,
title = {{Stable Mean Teacher for Semi-Supervised Video Action Detection}},
author = {Kumar, Akash and Mitra, Sirshapan and Rawat, Yogesh Singh},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {4419-4427},
doi = {10.1609/AAAI.V39I4.32465},
url = {https://mlanthology.org/aaai/2025/kumar2025aaai-stable/}
}