Multi-Scale Motion-Aware Module for Video Action Recognition

Abstract

Due to the lengthy computing time for optical flow, recent works have proposed to use the correlation operation as an alternative approach to extracting motion features. Although using correlation operations shows significant improvement with negligible FLOPs, it introduces much more latency per FLOP than convolution operations and increases noticeable latency as a larger searching patch is applied. Nonetheless, shrinking the searching patch in correlation operation is doomed to degrade its performance owing to the inability to capture larger displacements. In this paper, we propose an effective and low-latency Multi-Scale Motion-Aware (MSMA) module. It uses smaller searching patches at different scales for efficiently extracting motion features from large displacements. It can be installed into and generalizes well on different CNN backbones. When installed into TSM ResNet-50, the MSMA module introduces $\approx $ ≈ 17.6% more latency on NVIDIA Tesla V100 GPU, yet, it achieves state-of-the-art performance on Something-Something V1 & V2 and Diving-48.

Cite

Text

Peng and Tseng. "Multi-Scale Motion-Aware Module for Video Action Recognition." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25075-0_40

Markdown

[Peng and Tseng. "Multi-Scale Motion-Aware Module for Video Action Recognition." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/peng2022eccvw-multiscale/) doi:10.1007/978-3-031-25075-0_40

BibTeX

@inproceedings{peng2022eccvw-multiscale,
  title     = {{Multi-Scale Motion-Aware Module for Video Action Recognition}},
  author    = {Peng, Huai-Wei and Tseng, Yu-Chee},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2022},
  pages     = {589-606},
  doi       = {10.1007/978-3-031-25075-0_40},
  url       = {https://mlanthology.org/eccvw/2022/peng2022eccvw-multiscale/}
}