Shrinking Temporal Attention in Transformers for Video Action Recognition

Abstract

Spatiotemporal modeling in an unified architecture is key for video action recognition. This paper proposes a Shrinking Temporal Attention Transformer (STAT), which efficiently builts spatiotemporal attention maps considering the attenuation of spatial attention in short and long temporal sequences. Specifically, for short-term temporal tokens, query token interacts with them in a fine-grained manner in dealing with short-range motion. It then shrinks to a coarse attention in neighborhood for long-term tokens, to provide larger receptive field for long-range spatial aggregation. Both of them are composed in a short-long temporal integrated block to build visual appearances and temporal structure concurrently with lower costly in computation. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple action recognition benchmarks including Kinetics400 and Something-Something v2, outperforming prior methods with 50% less FLOPs and without any pretrained model.

Cite

Text

Li et al. "Shrinking Temporal Attention in Transformers for Video Action Recognition." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I2.20013

Markdown

[Li et al. "Shrinking Temporal Attention in Transformers for Video Action Recognition." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/li2022aaai-shrinking/) doi:10.1609/AAAI.V36I2.20013

BibTeX

@inproceedings{li2022aaai-shrinking,
  title     = {{Shrinking Temporal Attention in Transformers for Video Action Recognition}},
  author    = {Li, Bonan and Xiong, Pengfei and Han, Congying and Guo, Tiande},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {1263-1271},
  doi       = {10.1609/AAAI.V36I2.20013},
  url       = {https://mlanthology.org/aaai/2022/li2022aaai-shrinking/}
}