Residual Stacked RNNs for Action Recognition

Abstract

Action recognition pipelines that use Recurrent Neural Networks (RNN) are currently 5–10% less accurate than Convolutional Neural Networks (CNN). While most works that use RNNs employ a 2D CNN on each frame to extract descriptors for action recognition, we extract spatiotemporal features from a 3D CNN and then learn the temporal relationship of these descriptors through a stacked residual recurrent neural network (Res-RNN). We introduce for the first time residual learning to counter the degradation problem in multi-layer RNNs, which have been successful for temporal aggregation in two-stream action recognition pipelines. Finally, we use a late fusion strategy to combine RGB and optical flow data of the two-stream Res-RNN. Experimental results show that the proposed pipeline achieves competitive results on UCF-101 and state of-the-art results for RNN-like architectures on the challenging HMDB-51 dataset.

Cite

Text

Lakhal et al. "Residual Stacked RNNs for Action Recognition." European Conference on Computer Vision Workshops, 2018. doi:10.1007/978-3-030-11012-3_40

Markdown

[Lakhal et al. "Residual Stacked RNNs for Action Recognition." European Conference on Computer Vision Workshops, 2018.](https://mlanthology.org/eccvw/2018/lakhal2018eccvw-residual/) doi:10.1007/978-3-030-11012-3_40

BibTeX

@inproceedings{lakhal2018eccvw-residual,
  title     = {{Residual Stacked RNNs for Action Recognition}},
  author    = {Lakhal, Mohamed Ilyes and Clapés, Albert and Escalera, Sergio and Lanz, Oswald and Cavallaro, Andrea},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2018},
  pages     = {534-548},
  doi       = {10.1007/978-3-030-11012-3_40},
  url       = {https://mlanthology.org/eccvw/2018/lakhal2018eccvw-residual/}
}