Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition
Abstract
This paper argues that large-scale action recognition in video can be greatly improved by providing an additional modality in training data -- namely, 3D human-skeleton sequences -- aimed at complementing poorly represented or missing features of human actions in the training videos. For recognition, we use Long Short Term Memory (LSTM) grounded via a deep Convolutional Neural Network (CNN) onto the video. Training of LSTM is regularized using the output of another encoder LSTM (eLSTM) grounded on 3D human-skeleton training data. For such regularized training of LSTM, we modify the standard backpropagation through time (BPTT) in order to address the well-known issues with gradient descent in constraint optimization. Our evaluation on three benchmark datasets -- Sports-1M, HMDB-51, and UCF101 -- shows accuracy improvements from 5.3% up to 17.4% relative to the state of the art.
Cite
Text
Mahasseni and Todorovic. "Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition." Conference on Computer Vision and Pattern Recognition, 2016. doi:10.1109/CVPR.2016.333Markdown
[Mahasseni and Todorovic. "Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition." Conference on Computer Vision and Pattern Recognition, 2016.](https://mlanthology.org/cvpr/2016/mahasseni2016cvpr-regularizing/) doi:10.1109/CVPR.2016.333BibTeX
@inproceedings{mahasseni2016cvpr-regularizing,
title = {{Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition}},
author = {Mahasseni, Behrooz and Todorovic, Sinisa},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2016},
doi = {10.1109/CVPR.2016.333},
url = {https://mlanthology.org/cvpr/2016/mahasseni2016cvpr-regularizing/}
}