Action Snippets: How Many Frames Does Human Action Recognition Require?

Abstract

Visual recognition of human actions in video clips has been an active field of research in recent years. However, most published methods either analyse an entire video and assign it a single action label, or use relatively large look-ahead to classify each frame. Contrary to these strategies, human vision proves that simple actions can be recognised almost instantaneously. In this paper, we present a system for action recognition from very short sequences (ldquosnippetsrdquo) of 1-10 frames, and systematically evaluate it on standard data sets. It turns out that even local shape and optic flow for a single frame are enough to achieve ap90% correct recognitions, and snippets of 5-7 frames (0.3-0.5 seconds of video) are enough to achieve a performance similar to the one obtainable with the entire video sequence.

Cite

Text

Schindler and Van Gool. "Action Snippets: How Many Frames Does Human Action Recognition Require?." IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2008. doi:10.1109/CVPR.2008.4587730

Markdown

[Schindler and Van Gool. "Action Snippets: How Many Frames Does Human Action Recognition Require?." IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2008.](https://mlanthology.org/cvpr/2008/schindler2008cvpr-action/) doi:10.1109/CVPR.2008.4587730

BibTeX

@inproceedings{schindler2008cvpr-action,
  title     = {{Action Snippets: How Many Frames Does Human Action Recognition Require?}},
  author    = {Schindler, Konrad and Van Gool, Luc},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2008},
  doi       = {10.1109/CVPR.2008.4587730},
  url       = {https://mlanthology.org/cvpr/2008/schindler2008cvpr-action/}
}