Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach

Abstract

We present a compositional model for video event detection. A video is modeled using a collection of both global and segment-level features and kernel functions are employed for similarity comparisons. The locations of salient, discriminative video segments are treated as a latent variable, allowing the model to explicitly ignore portions of the video that are unimportant for classification. A novel, multiple kernel learning (MKL) latent support vector machine (SVM) is defined, that is used to combine and re-weight multiple feature types in a principled fashion while simultaneously operating within the latent variable framework. The compositional nature of the proposed model allows it to respond directly to the challenges of temporal clutter and intra-class variation, which are prevalent in unconstrained internet videos. Experimental results on the TRECVID Multimedia Event Detection 2011 (MED11) dataset demonstrate the efficacy of the method.

Cite

Text

Vahdat et al. "Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach." International Conference on Computer Vision, 2013. doi:10.1109/ICCV.2013.463

Markdown

[Vahdat et al. "Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach." International Conference on Computer Vision, 2013.](https://mlanthology.org/iccv/2013/vahdat2013iccv-compositional/) doi:10.1109/ICCV.2013.463

BibTeX

@inproceedings{vahdat2013iccv-compositional,
  title     = {{Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach}},
  author    = {Vahdat, Arash and Cannons, Kevin and Mori, Greg and Oh, Sangmin and Kim, Ilseo},
  booktitle = {International Conference on Computer Vision},
  year      = {2013},
  doi       = {10.1109/ICCV.2013.463},
  url       = {https://mlanthology.org/iccv/2013/vahdat2013iccv-compositional/}
}