Embedding Sequential Information into Spatiotemporal Features for Action Recognition
Abstract
In this paper, we introduce a novel framework for video-based action recognition, which incorporates the sequential information with the spatiotemporal features. Specifically, the spatiotemporal features are extracted from the sliced clips of videos, and then a recurrent neural network is applied to embed the sequential information into the final feature representation of the video. In contrast to most current deep learning methods for the video-based tasks, our framework incorporates both long-term dependencies and spatiotemporal information of the clips in the video. To extract the spatiotemporal features from the clips, both dense trajectories (DT) and a newly proposed 3D neural network, C3D, are applied in our experiments. Our proposed framework is evaluated on the benchmark datasets of UCF101 and HMDB51, and achieves comparable performance compared with the state-of-the-art results.
Cite
Text
Ye and Tian. "Embedding Sequential Information into Spatiotemporal Features for Action Recognition." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2016. doi:10.1109/CVPRW.2016.142Markdown
[Ye and Tian. "Embedding Sequential Information into Spatiotemporal Features for Action Recognition." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2016.](https://mlanthology.org/cvprw/2016/ye2016cvprw-embedding/) doi:10.1109/CVPRW.2016.142BibTeX
@inproceedings{ye2016cvprw-embedding,
title = {{Embedding Sequential Information into Spatiotemporal Features for Action Recognition}},
author = {Ye, Yuancheng and Tian, Yingli},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2016},
pages = {1110-1118},
doi = {10.1109/CVPRW.2016.142},
url = {https://mlanthology.org/cvprw/2016/ye2016cvprw-embedding/}
}