Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

Abstract

Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.

Cite

Text

Chen et al. "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19809-0_11

Markdown

[Chen et al. "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/chen2022eccv-hierarchically/) doi:10.1007/978-3-031-19809-0_11

BibTeX

@inproceedings{chen2022eccv-hierarchically,
  title     = {{Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning}},
  author    = {Chen, Yuxiao and Zhao, Long and Yuan, Jianbo and Tian, Yu and Xia, Zhaoyang and Geng, Shijie and Han, Ligong and Metaxas, Dimitris N.},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19809-0_11},
  url       = {https://mlanthology.org/eccv/2022/chen2022eccv-hierarchically/}
}