Objects2action: Classifying and Localizing Actions Without Any Video Example

Abstract

The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.

Cite

Text

Jain et al. "Objects2action: Classifying and Localizing Actions Without Any Video Example." International Conference on Computer Vision, 2015. doi:10.1109/ICCV.2015.521

Markdown

[Jain et al. "Objects2action: Classifying and Localizing Actions Without Any Video Example." International Conference on Computer Vision, 2015.](https://mlanthology.org/iccv/2015/jain2015iccv-objects2action/) doi:10.1109/ICCV.2015.521

BibTeX

@inproceedings{jain2015iccv-objects2action,
  title     = {{Objects2action: Classifying and Localizing Actions Without Any Video Example}},
  author    = {Jain, Mihir and van Gemert, Jan C. and Mensink, Thomas and Snoek, Cees G. M.},
  booktitle = {International Conference on Computer Vision},
  year      = {2015},
  doi       = {10.1109/ICCV.2015.521},
  url       = {https://mlanthology.org/iccv/2015/jain2015iccv-objects2action/}
}