Multi-Scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

Haoze Wu, Jiawei Liu, Xierong Zhu, Meng Wang, Zheng-Jun Zha

IJCAI 2020 pp. 753-759

doi:10.24963/IJCAI.2020/105 /ijcai/2020/wu2020ijcai-multi/

Abstract

Applying multi-scale representations leads to consistent performance improvements on a wide range of image recognition tasks. However, with the addition of the temporal dimension in video domain, directly obtaining layer-wise multi-scale spatial-temporal features will add a lot extra computational cost. In this work, we propose a novel and efficient Multi-Scale Spatial-Temporal Integration Convolutional Tube (MSTI) aiming at achieving accurate recognition of actions with lower computational cost. It firstly extracts multi-scale spatial and temporal features through the multi-scale convolution block. Considering the interaction of different-scales representations and the interaction of spatial appearance and temporal motion, we employ the cross-scale attention weighted blocks to perform feature recalibration by integrating multi-scale spatial and temporal features. An end-to-end deep network, MSTI-Net, is also presented based on the proposed MSTI tube for human action recognition. Extensive experimental results show that our MSTI-Net significantly boosts the performance of existing convolution networks and achieves state-of-the-art accuracy on three challenging benchmarks, i.e., UCF-101, HMDB-51 and Kinetics-400, with much fewer parameters and FLOPs.

PDF IJCAI Semantic Scholar

Cite

Text

Wu et al. "Multi-Scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition." International Joint Conference on Artificial Intelligence, 2020. doi:10.24963/IJCAI.2020/105

Markdown

[Wu et al. "Multi-Scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition." International Joint Conference on Artificial Intelligence, 2020.](https://mlanthology.org/ijcai/2020/wu2020ijcai-multi/) doi:10.24963/IJCAI.2020/105

BibTeX

@inproceedings{wu2020ijcai-multi,
  title     = {{Multi-Scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition}},
  author    = {Wu, Haoze and Liu, Jiawei and Zhu, Xierong and Wang, Meng and Zha, Zheng-Jun},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2020},
  pages     = {753-759},
  doi       = {10.24963/IJCAI.2020/105},
  url       = {https://mlanthology.org/ijcai/2020/wu2020ijcai-multi/}
}