Learning to Track for Spatio-Temporal Action Localization

Abstract

We propose an effective approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features. It then tracks high-scoring proposals throughout the video using a tracking-by-detection approach. Our tracker relies simultaneously on instance-level and class-level detectors. The tracks are scored using a spatio-temporal motion histogram, a descriptor at the track level, in combination with the CNN features. Finally, we perform temporal localization of the action using a sliding-window approach at the track level. We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.

Cite

Text

Weinzaepfel et al. "Learning to Track for Spatio-Temporal Action Localization." International Conference on Computer Vision, 2015. doi:10.1109/ICCV.2015.362

Markdown

[Weinzaepfel et al. "Learning to Track for Spatio-Temporal Action Localization." International Conference on Computer Vision, 2015.](https://mlanthology.org/iccv/2015/weinzaepfel2015iccv-learning/) doi:10.1109/ICCV.2015.362

BibTeX

@inproceedings{weinzaepfel2015iccv-learning,
  title     = {{Learning to Track for Spatio-Temporal Action Localization}},
  author    = {Weinzaepfel, Philippe and Harchaoui, Zaid and Schmid, Cordelia},
  booktitle = {International Conference on Computer Vision},
  year      = {2015},
  doi       = {10.1109/ICCV.2015.362},
  url       = {https://mlanthology.org/iccv/2015/weinzaepfel2015iccv-learning/}
}