Spatiotemporal Multiplier Networks for Video Action Recognition

Abstract

This paper presents a general ConvNet architecture for video action recognition based on multiplicative interactions of spacetime features. Our model combines the appearance and motion pathways of a two-stream architecture by motion gating and is trained end-to-end. We theoretically motivate multiplicative gating functions for residual networks and empirically study their effect on classification accuracy. To capture long-term dependencies we inject identity mapping kernels for learning temporal relationships. Our architecture is fully convolutional in spacetime and able to evaluate a video in a single forward pass. Empirical investigation reveals that our model produces state-of-the-art results on two standard action recognition datasets.

Cite

Text

Feichtenhofer et al. "Spatiotemporal Multiplier Networks for Video Action Recognition." Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.787

Markdown

[Feichtenhofer et al. "Spatiotemporal Multiplier Networks for Video Action Recognition." Conference on Computer Vision and Pattern Recognition, 2017.](https://mlanthology.org/cvpr/2017/feichtenhofer2017cvpr-spatiotemporal/) doi:10.1109/CVPR.2017.787

BibTeX

@inproceedings{feichtenhofer2017cvpr-spatiotemporal,
  title     = {{Spatiotemporal Multiplier Networks for Video Action Recognition}},
  author    = {Feichtenhofer, Christoph and Pinz, Axel and Wildes, Richard P.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2017},
  doi       = {10.1109/CVPR.2017.787},
  url       = {https://mlanthology.org/cvpr/2017/feichtenhofer2017cvpr-spatiotemporal/}
}