End-to-End Joint Semantic Segmentation of Actors and Actions in Video

Abstract

Traditional video understanding tasks include human action recognition and actor/object semantic segmentation. However, the combined task of providing semantic segmentation for different actor classes simultaneously with their action class remains a challenging but necessary task for many applications. In this work, we propose a new end-to-end architecture for tackling this task in videos. Our model effectively leverages multiple input modalities, contextual information, and multitask learning in the video to directly output semantic segmentations in a single unified framework. We train and benchmark our model on the Actor-Action Dataset (A2D) for joint actor-action semantic segmentation, and demonstrate state-of-the-art performance for both segmentation and detection. We also perform experiments verifying our approach improves performance for zero-shot recognition, indicating generalizability of our jointly learned feature space.

Cite

Text

Ji et al. "End-to-End Joint Semantic Segmentation of Actors and Actions in Video." Proceedings of the European Conference on Computer Vision (ECCV), 2018. doi:10.1007/978-3-030-01225-0_43

Markdown

[Ji et al. "End-to-End Joint Semantic Segmentation of Actors and Actions in Video." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/ji2018eccv-endtoend/) doi:10.1007/978-3-030-01225-0_43

BibTeX

@inproceedings{ji2018eccv-endtoend,
  title     = {{End-to-End Joint Semantic Segmentation of Actors and Actions in Video}},
  author    = {Ji, Jingwei and Buch, Shyamal and Soto, Alvaro and Niebles, Juan Carlos},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2018},
  doi       = {10.1007/978-3-030-01225-0_43},
  url       = {https://mlanthology.org/eccv/2018/ji2018eccv-endtoend/}
}