Modeling Multi-Label Action Dependencies for Temporal Action Localization

Abstract

Real world videos contain many complex actions with inherent relationships between action classes. In this work, we propose an attention-based architecture that model these action relationships for the task of temporal action localization in untrimmed videos. As opposed to previous works which leverage video-level co-occurrence of actions, we distinguish the relationships between actions that occur at the same time-step and actions that occur at different time-steps (i.e. those which precede or follow each other). We define these distinct relationships as action dependencies. We propose to improve action localization performance by modeling these action dependencies in a novel attention based Multi-Label Action Dependency (MLAD) layer. The MLAD layer consists of two branches: a Co-occurrence Dependency Branch and a Temporal Dependency Branch to model co-occurrence action dependencies and temporal action dependencies, respectively. We observe that existing metrics used for multi-label classification do not explicitly measure how well action dependencies are modeled, therefore, we propose novel metrics which consider both co-occurrence and temporal dependencies between action classes. Through empirical evaluation and extensive analysis we show improved performance over state-of-the art methods on multi-label action localization benchmarks (MultiTHUMOS and Charades) in terms of f-mAP and our proposed metric.

Cite

Text

Tirupattur et al. "Modeling Multi-Label Action Dependencies for Temporal Action Localization." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00151

Markdown

[Tirupattur et al. "Modeling Multi-Label Action Dependencies for Temporal Action Localization." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/tirupattur2021cvpr-modeling/) doi:10.1109/CVPR46437.2021.00151

BibTeX

@inproceedings{tirupattur2021cvpr-modeling,
  title     = {{Modeling Multi-Label Action Dependencies for Temporal Action Localization}},
  author    = {Tirupattur, Praveen and Duarte, Kevin and Rawat, Yogesh S and Shah, Mubarak},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {1460-1470},
  doi       = {10.1109/CVPR46437.2021.00151},
  url       = {https://mlanthology.org/cvpr/2021/tirupattur2021cvpr-modeling/}
}