Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation

Abstract

Skeleton-based Temporal Action Segmentation (STAS) aims to densely segment and classify human actions in long, untrimmed skeletal motion sequences. Existing STAS methods primarily model spatial dependencies among joints and temporal relationships among frames to generate frame-level one-hot classifications. However, these methods overlook the deep mining of semantic relations among joints as well as actions at a linguistic level, which limits the comprehensiveness of skeleton action understanding. In this work, we propose a Language-assisted Skeleton Action Understanding (LaSA) method that leverages the language modality to assist in learning semantic relationships among joints and actions. Specifically, in terms of joint relationships, the Joint Relationships Establishment (JRE) module establishes correlations among joints in the feature sequence by applying attention between joint texts and differentiates distinct joints by embedding joint texts as positional embeddings. Regarding action relationships, the Action Relationships Supervision (ARS) module enhances the discrimination across action classes through contrastive learning of single-class action-text pairs and models the semantic associations of adjacent actions by contrasting mixed-class clip-text pairs. Performance evaluation on five public datasets demonstrates that LaSA achieves state-of-the-art results. Code is available at https://github.com/HaoyuJi/LaSA.

Cite

Text

Ji et al. "Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72949-2_23

Markdown

[Ji et al. "Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/ji2024eccv-languageassisted/) doi:10.1007/978-3-031-72949-2_23

BibTeX

@inproceedings{ji2024eccv-languageassisted,
  title     = {{Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation}},
  author    = {Ji, Haoyu and Chen, Bowen and Xu, Xinglong and Ren, Weihong and Wang, Zhiyong and Liu, Honghai},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72949-2_23},
  url       = {https://mlanthology.org/eccv/2024/ji2024eccv-languageassisted/}
}