Visual Code-Sentences: A New Video Representation Based on Image Descriptor Sequences

Abstract

We present a new descriptor-sequence model for action recognition that enhances discriminative power in the spatio-temporal context, while maintaining robustness against background clutter as well as variability in inter-/intra-person behavior. We extend the framework of Dense Trajectories based activity recognition (Wang et al ., 2011) and introduce a pool of dynamic Bayesian networks (e.g., multiple HMMs) with histogram descriptors as codebooks of composite action categories represented at respective key points. The entire codebooks bound with spatio-temporal interest points constitute intermediate feature representation as basis for generic action categories. This representation scheme is intended to serve as visual code-sentences which subsume a rich vocabulary of basis action categories. Through extensive experiments using KTH, UCF Sports, and Hollywood2 datasets, we demonstrate some improvements over the state-of-the-art methods.

Cite

Text

Mitarai and Matsugu. "Visual Code-Sentences: A New Video Representation Based on Image Descriptor Sequences." European Conference on Computer Vision, 2012. doi:10.1007/978-3-642-33863-2_32

Markdown

[Mitarai and Matsugu. "Visual Code-Sentences: A New Video Representation Based on Image Descriptor Sequences." European Conference on Computer Vision, 2012.](https://mlanthology.org/eccv/2012/mitarai2012eccv-visual/) doi:10.1007/978-3-642-33863-2_32

BibTeX

@inproceedings{mitarai2012eccv-visual,
  title     = {{Visual Code-Sentences: A New Video Representation Based on Image Descriptor Sequences}},
  author    = {Mitarai, Yusuke and Matsugu, Masakazu},
  booktitle = {European Conference on Computer Vision},
  year      = {2012},
  pages     = {321-331},
  doi       = {10.1007/978-3-642-33863-2_32},
  url       = {https://mlanthology.org/eccv/2012/mitarai2012eccv-visual/}
}