Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition

Abstract

This paper argues that large-scale action recognition in video can be greatly improved by providing an additional modality in training data -- namely, 3D human-skeleton sequences -- aimed at complementing poorly represented or missing features of human actions in the training videos. For recognition, we use Long Short Term Memory (LSTM) grounded via a deep Convolutional Neural Network (CNN) onto the video. Training of LSTM is regularized using the output of another encoder LSTM (eLSTM) grounded on 3D human-skeleton training data. For such regularized training of LSTM, we modify the standard backpropagation through time (BPTT) in order to address the well-known issues with gradient descent in constraint optimization. Our evaluation on three benchmark datasets -- Sports-1M, HMDB-51, and UCF101 -- shows accuracy improvements from 5.3% up to 17.4% relative to the state of the art.

Cite

Text

Mahasseni and Todorovic. "Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition." Conference on Computer Vision and Pattern Recognition, 2016. doi:10.1109/CVPR.2016.333

Markdown

[Mahasseni and Todorovic. "Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition." Conference on Computer Vision and Pattern Recognition, 2016.](https://mlanthology.org/cvpr/2016/mahasseni2016cvpr-regularizing/) doi:10.1109/CVPR.2016.333

BibTeX

@inproceedings{mahasseni2016cvpr-regularizing,
  title     = {{Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition}},
  author    = {Mahasseni, Behrooz and Todorovic, Sinisa},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2016},
  doi       = {10.1109/CVPR.2016.333},
  url       = {https://mlanthology.org/cvpr/2016/mahasseni2016cvpr-regularizing/}
}