Multi-Region Two-Stream R-CNN for Action Detection

Abstract

We propose a multi-region two-stream R-CNN model for action detection in realistic videos. We start from frame-level action detection based on faster R-CNN, and make three contributions: (1) we show that a motion region proposal network generates high-quality proposals, which are complementary to those of an appearance region proposal network; (2) we show that stacking optical flow over several frames significantly improves frame-level action detection; and (3) we embed a multi-region scheme in the faster R-CNN model, which adds complementary information on body parts. We then link frame-level detections with the Viterbi algorithm, and temporally localize an action with the maximum subarray method. Experimental results on the UCF-Sports, J-HMDB and UCF101 action detection datasets show that our approach outperforms the state of the art with a significant margin in both frame-mAP and video-mAP.

Cite

Text

Peng and Schmid. "Multi-Region Two-Stream R-CNN for Action Detection." European Conference on Computer Vision, 2016. doi:10.1007/978-3-319-46493-0_45

Markdown

[Peng and Schmid. "Multi-Region Two-Stream R-CNN for Action Detection." European Conference on Computer Vision, 2016.](https://mlanthology.org/eccv/2016/peng2016eccv-multi/) doi:10.1007/978-3-319-46493-0_45

BibTeX

@inproceedings{peng2016eccv-multi,
  title     = {{Multi-Region Two-Stream R-CNN for Action Detection}},
  author    = {Peng, Xiaojiang and Schmid, Cordelia},
  booktitle = {European Conference on Computer Vision},
  year      = {2016},
  pages     = {744-759},
  doi       = {10.1007/978-3-319-46493-0_45},
  url       = {https://mlanthology.org/eccv/2016/peng2016eccv-multi/}
}