Part-Aware Unified Representation of Language and Skeleton for Zero-Shot Action Recognition
Abstract
While remarkable progress has been made on supervised skeleton-based action recognition the challenge of zero-shot recognition remains relatively unexplored. In this paper we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets i.e. NTU-RGB+D 60 NTU-RGB+D 120 and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzh1/PURLS.
Cite
Text
Zhu et al. "Part-Aware Unified Representation of Language and Skeleton for Zero-Shot Action Recognition." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01775Markdown
[Zhu et al. "Part-Aware Unified Representation of Language and Skeleton for Zero-Shot Action Recognition." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/zhu2024cvpr-partaware/) doi:10.1109/CVPR52733.2024.01775BibTeX
@inproceedings{zhu2024cvpr-partaware,
title = {{Part-Aware Unified Representation of Language and Skeleton for Zero-Shot Action Recognition}},
author = {Zhu, Anqi and Ke, Qiuhong and Gong, Mingming and Bailey, James},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {18761-18770},
doi = {10.1109/CVPR52733.2024.01775},
url = {https://mlanthology.org/cvpr/2024/zhu2024cvpr-partaware/}
}