Where to Focus on for Human Action Recognition?
Abstract
In this paper, we present a new attention model for the recognition of human action from RGB-D videos. We propose an attention mechanism based on 3D articulated pose. The objective is to focus on the most relevant body parts involved in the action. For action classification, we propose a classification network compounded of spatio-temporal subnetworks modeling the appearance of human body parts and RNN attention subnetwork implementing our attention mechanism. Furthermore, we train our proposed network end-to-end using a regularized cross-entropy loss, leading to a joint training of the RNN delivering attention globally to the whole set of spatio-temporal features, extracted from 3D ConvNets. Our method outperforms the State-of-the-art methods on the largest human activity recognition dataset available to-date (NTU RGB+D Dataset) which is also multi-views and on a human action recognition dataset with object interaction (Northwestern-UCLA Multiview Action 3D Dataset).
Cite
Text
Das et al. "Where to Focus on for Human Action Recognition?." IEEE/CVF Winter Conference on Applications of Computer Vision, 2019. doi:10.1109/WACV.2019.00015Markdown
[Das et al. "Where to Focus on for Human Action Recognition?." IEEE/CVF Winter Conference on Applications of Computer Vision, 2019.](https://mlanthology.org/wacv/2019/das2019wacv-focus/) doi:10.1109/WACV.2019.00015BibTeX
@inproceedings{das2019wacv-focus,
title = {{Where to Focus on for Human Action Recognition?}},
author = {Das, Srijan and Chaudhary, Arpit and Brémond, François and Thonnat, Monique},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2019},
pages = {71-80},
doi = {10.1109/WACV.2019.00015},
url = {https://mlanthology.org/wacv/2019/das2019wacv-focus/}
}