Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos
Abstract
We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding. The proposed method addresses an important problem of video understanding: how to build a video representation that incorporates the CNN features over the entire video. Feature assignment is carried out at two levels, by using the similarity and spatio-temporal information. For each assignment we build a specific encoding, focused on the nature of deep features, with the goal to capture the highest feature responses from the highest neuron activation of the network. Our ST-VLMPF clearly provides a more reliable video representation than some of the most widely used and powerful encoding approaches (Improved Fisher Vectors and Vector of Locally Aggregated Descriptors), while maintaining a low computational complexity. We conduct experiments on three action recognition datasets: HMDB51, UCF50 and UCF101. Our pipeline obtains state-of-the-art results.
Cite
Text
Duta et al. "Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos." Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.341Markdown
[Duta et al. "Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos." Conference on Computer Vision and Pattern Recognition, 2017.](https://mlanthology.org/cvpr/2017/duta2017cvpr-spatiotemporal/) doi:10.1109/CVPR.2017.341BibTeX
@inproceedings{duta2017cvpr-spatiotemporal,
title = {{Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos}},
author = {Duta, Ionut Cosmin and Ionescu, Bogdan and Aizawa, Kiyoharu and Sebe, Nicu},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2017},
doi = {10.1109/CVPR.2017.341},
url = {https://mlanthology.org/cvpr/2017/duta2017cvpr-spatiotemporal/}
}