Harnessing Object and Scene Semantics for Large-Scale Video Understanding

Wu, Zuxuan; Fu, Yanwei; Jiang, Yu-Gang; Sigal, Leonid

doi:10.1109/CVPR.2016.339

Harnessing Object and Scene Semantics for Large-Scale Video Understanding

Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, Leonid Sigal

CVPR 2016

doi:10.1109/CVPR.2016.339 /cvpr/2016/wu2016cvpr-harnessing/

Abstract

Large-scale action recognition and video categorization are important problems in computer vision. To address these problems, we propose a novel object- and scene-based semantic fusion network and representation. Our semantic fusion network combines three streams of information using a three-layer neural network: (i) frame-based low-level CNN features, (ii) object features from a state-of-the-art large-scale CNN object-detector trained to recognize 20K classes, and (iii) scene features from a state-of-the-art CNN scene-detector trained to recognize 205 scenes. The trained network achieves improvements in supervised activity and video categorization in two complex large-scale datasets - ActivityNet and FCVID, respectively. Further, by examining and back propagating information through the fusion network, semantic relationships (correlations) between video classes and objects/scenes can be discovered. These video class-object/video class-scene relationships can in turn be used as semantic representation for the video classes themselves. We illustrate effectiveness of this semantic representation through experiments on zero-shot action/video classification and clustering.

PDF CVPR Semantic Scholar

Cite

Text

Wu et al. "Harnessing Object and Scene Semantics for Large-Scale Video Understanding." Conference on Computer Vision and Pattern Recognition, 2016. doi:10.1109/CVPR.2016.339

Markdown

[Wu et al. "Harnessing Object and Scene Semantics for Large-Scale Video Understanding." Conference on Computer Vision and Pattern Recognition, 2016.](https://mlanthology.org/cvpr/2016/wu2016cvpr-harnessing/) doi:10.1109/CVPR.2016.339

BibTeX

@inproceedings{wu2016cvpr-harnessing,
  title     = {{Harnessing Object and Scene Semantics for Large-Scale Video Understanding}},
  author    = {Wu, Zuxuan and Fu, Yanwei and Jiang, Yu-Gang and Sigal, Leonid},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2016},
  doi       = {10.1109/CVPR.2016.339},
  url       = {https://mlanthology.org/cvpr/2016/wu2016cvpr-harnessing/}
}