Combining Per-Frame and Per-Track Cues for Multi-Person Action Recognition

Abstract

We propose a model to combine per-frame and per-track cues for action recognition. With multiple targets in a scene, our model simultaneously captures the natural harmony of an individual’s action in a scene and the flow of actions of an individual in a video sequence, inferring valid tracks in the process. Our motivation is based on the unlikely discordance of an action in a structured scene, both at the track level and the frame level ( e.g. , a person dancing in a crowd of joggers). While we can utilize sampling approaches for inference in our model, we instead devise a global inference algorithm by decomposing the problem and solving the subproblems exactly and efficiently, recovering a globally optimal joint solution in several cases. Finally, we improve on the state-of-the-art action recognition results for two publicly available datasets.

Cite

Text

Khamis et al. "Combining Per-Frame and Per-Track Cues for Multi-Person Action Recognition." European Conference on Computer Vision, 2012. doi:10.1007/978-3-642-33718-5_9

Markdown

[Khamis et al. "Combining Per-Frame and Per-Track Cues for Multi-Person Action Recognition." European Conference on Computer Vision, 2012.](https://mlanthology.org/eccv/2012/khamis2012eccv-combining/) doi:10.1007/978-3-642-33718-5_9

BibTeX

@inproceedings{khamis2012eccv-combining,
  title     = {{Combining Per-Frame and Per-Track Cues for Multi-Person Action Recognition}},
  author    = {Khamis, Sameh and Morariu, Vlad I. and Davis, Larry S.},
  booktitle = {European Conference on Computer Vision},
  year      = {2012},
  pages     = {116-129},
  doi       = {10.1007/978-3-642-33718-5_9},
  url       = {https://mlanthology.org/eccv/2012/khamis2012eccv-combining/}
}