Videos as Space-Time Region Graphs

Abstract

How do humans recognize the action "opening a book"? We argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects. In this paper, we propose to represent videos as space-time region graphs which capture these two important cues. Our graph nodes are defined by the object region proposals from different frames in a long range video. These nodes are connected by two types of relations: (i) similarity relations capturing the long range dependencies between correlated objects and (ii) spatial-temporal relations capturing the interactions between nearby objects. We perform reasoning on this graph representation via Graph Convolutional Networks. We achieve state-of-the-art results on the Charades and Something-Something datasets. Especially for Charades with complex environments, we obtain a huge 4.4% gain when our model is applied in complex environments.

Cite

Text

Wang and Gupta. "Videos as Space-Time Region Graphs." Proceedings of the European Conference on Computer Vision (ECCV), 2018. doi:10.1007/978-3-030-01228-1_25

Markdown

[Wang and Gupta. "Videos as Space-Time Region Graphs." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/wang2018eccv-videos/) doi:10.1007/978-3-030-01228-1_25

BibTeX

@inproceedings{wang2018eccv-videos,
  title     = {{Videos as Space-Time Region Graphs}},
  author    = {Wang, Xiaolong and Gupta, Abhinav},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2018},
  doi       = {10.1007/978-3-030-01228-1_25},
  url       = {https://mlanthology.org/eccv/2018/wang2018eccv-videos/}
}