SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

Jiao, Yingying; Wang, Zhigang; Wu, Sifan; Fan, Shaojing; Liu, Zhenguang; Xu, Zhuoyue; Wu, Zheqi

doi:10.1609/AAAI.V39I4.32429

SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

Yingying Jiao, Zhigang Wang, Sifan Wu, Shaojing Fan, Zhenguang Liu, Zhuoyue Xu, Zheqi Wu

AAAI 2025 pp. 4093-4101

doi:10.1609/AAAI.V39I4.32429 /aaai/2025/jiao2025aaai-spatiotemporal/

Abstract

Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.

PDF AAAI Semantic Scholar

Cite

Text

Jiao et al. "SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32429

Markdown

[Jiao et al. "SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/jiao2025aaai-spatiotemporal/) doi:10.1609/AAAI.V39I4.32429

BibTeX

@inproceedings{jiao2025aaai-spatiotemporal,
  title     = {{SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos}},
  author    = {Jiao, Yingying and Wang, Zhigang and Wu, Sifan and Fan, Shaojing and Liu, Zhenguang and Xu, Zhuoyue and Wu, Zheqi},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {4093-4101},
  doi       = {10.1609/AAAI.V39I4.32429},
  url       = {https://mlanthology.org/aaai/2025/jiao2025aaai-spatiotemporal/}
}