Video Action Detection with Relational Dynamic-Poselets
Abstract
Action detection is of great importance in understanding human motion from video. Compared with action recognition, it not only recognizes action type, but also localizes its spatiotemporal extent. This paper presents a relational model for action detection, which first decomposes human action into temporal “key poses” and then further into spatial “action parts”. Specifically, we start by clustering cuboids around each human joint into dynamic-poselets using a new descriptor. The cuboids from the same cluster share consistent geometric and dynamic structure, and each cluster acts as a mixture of body parts. We then propose a sequential skeleton model to capture the relations among dynamic-poselets. This model unifies the tasks of learning the composites of mixture dynamic-poselets, the spatiotemporal structures of action parts, and the local model for each action part in a single framework. Our model not only allows to localize the action in a video stream, but also enables a detailed pose estimation of an actor. We formulate the model learning problem in a structured SVM framework and speed up model inference by dynamic programming. We conduct experiments on three challenging action detection datasets: the MSR-II dataset, the UCF Sports dataset, and the JHMDB dataset. The results show that our method achieves superior performance to the state-of-the-art methods on these datasets.
Cite
Text
Wang et al. "Video Action Detection with Relational Dynamic-Poselets." European Conference on Computer Vision, 2014. doi:10.1007/978-3-319-10602-1_37Markdown
[Wang et al. "Video Action Detection with Relational Dynamic-Poselets." European Conference on Computer Vision, 2014.](https://mlanthology.org/eccv/2014/wang2014eccv-video-a/) doi:10.1007/978-3-319-10602-1_37BibTeX
@inproceedings{wang2014eccv-video-a,
title = {{Video Action Detection with Relational Dynamic-Poselets}},
author = {Wang, Limin and Qiao, Yu and Tang, Xiaoou},
booktitle = {European Conference on Computer Vision},
year = {2014},
pages = {565-580},
doi = {10.1007/978-3-319-10602-1_37},
url = {https://mlanthology.org/eccv/2014/wang2014eccv-video-a/}
}