Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Abstract
Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation.Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models, e.g., LSTM, to learn the intra-frame and inter-frame features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global videoclip. Our shifted self-attention can also effectively model complicated inter-frame variances. Furthermore, we build a clip encoder based on Transformer to model long-term temporal dependencies. We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600,UCF101, and HMDB51.
Cite
Text
Zha et al. "Shifted Chunk Transformer for Spatio-Temporal Representational Learning." Neural Information Processing Systems, 2021.Markdown
[Zha et al. "Shifted Chunk Transformer for Spatio-Temporal Representational Learning." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/zha2021neurips-shifted/)BibTeX
@inproceedings{zha2021neurips-shifted,
title = {{Shifted Chunk Transformer for Spatio-Temporal Representational Learning}},
author = {Zha, Xuefan and Zhu, Wentao and Xun, Lv and Yang, Sen and Liu, Ji},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/zha2021neurips-shifted/}
}