Rethinking Masked Data Reconstruction Pretraining for Strong 3D Action Representation Learning

Abstract

In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. For example, MAMP shows that instead of following the prevalent masked joint reconstruction, explicit masked motion reconstruction is key to the success of learning effective feature representation for 3D action recognition. However, we find that if we make a simple and effective change to the reconstructed target of masked joint reconstruction, masked joint reconstruction can achieve the same results as masked motion reconstruction. The devil is in the special characteristic of 3D skeleton data and the normalization process of training targets. We need to dig for all effective information of targets during normalization. Besides, considering that mask data reconstruction focuses more on learning local relations in input data for fulfilling the reconstruction task, instead of modeling the relation among samples, we further employ contrastive learning to learn more discriminative 3D action representations. We show that contrastive learning can consistently boost the performance of model pre-trained by masked joint prediction under various settings, especially in the semi-supervised setting that has a very limited number of labeled samples. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed pre-training strategy achieves state-of-the-art results without bells and whistles.

Cite

Text

Gong et al. "Rethinking Masked Data Reconstruction Pretraining for Strong 3D Action Representation Learning." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I3.32324

Markdown

[Gong et al. "Rethinking Masked Data Reconstruction Pretraining for Strong 3D Action Representation Learning." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/gong2025aaai-rethinking/) doi:10.1609/AAAI.V39I3.32324

BibTeX

@inproceedings{gong2025aaai-rethinking,
  title     = {{Rethinking Masked Data Reconstruction Pretraining for Strong 3D Action Representation Learning}},
  author    = {Gong, Tao and Chu, Qi and Liu, Bin and Yu, Nenghai},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {3149-3157},
  doi       = {10.1609/AAAI.V39I3.32324},
  url       = {https://mlanthology.org/aaai/2025/gong2025aaai-rethinking/}
}