VLAD3: Encoding Dynamics of Deep Features for Action Recognition
Abstract
Previous approaches to action recognition with deep features tend to process video frames only within a small temporal region, and do not model long-range dynamic information explicitly. However, such information is important for the accurate recognition of actions, especially for the discrimination of complex activities that share sub-actions, and when dealing with untrimmed videos. Here, we propose a representation, VLAD for Deep Dynamics (VLAD^3), that accounts for different levels of video dynamics. It captures short-term dynamics with deep convolutional neural network features, relying on linear dynamic systems (LDS) to model medium-range dynamics. To account for long-range inhomogeneous dynamics, a VLAD descriptor is derived for the LDS and pooled over the whole video, to arrive at the final VLAD^3 representation. An extensive evaluation was performed on Olympic Sports, UCF101 and THUMOS15, where the use of the VLAD^3 representation leads to state-of- the-art results.
Cite
Text
Li et al. "VLAD3: Encoding Dynamics of Deep Features for Action Recognition." Conference on Computer Vision and Pattern Recognition, 2016. doi:10.1109/CVPR.2016.215Markdown
[Li et al. "VLAD3: Encoding Dynamics of Deep Features for Action Recognition." Conference on Computer Vision and Pattern Recognition, 2016.](https://mlanthology.org/cvpr/2016/li2016cvpr-vlad3/) doi:10.1109/CVPR.2016.215BibTeX
@inproceedings{li2016cvpr-vlad3,
title = {{VLAD3: Encoding Dynamics of Deep Features for Action Recognition}},
author = {Li, Yingwei and Li, Weixin and Mahadevan, Vijay and Vasconcelos, Nuno},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2016},
doi = {10.1109/CVPR.2016.215},
url = {https://mlanthology.org/cvpr/2016/li2016cvpr-vlad3/}
}