Enriching Local and Global Contexts for Temporal Action Localization

Abstract

Effectively tackling the problem of temporal action localization (TAL) necessitates a visual representation that jointly pursues two confounding goals, i.e., fine-grained discrimination for temporal localization and sufficient visual invariance for action classification. We address this challenge by enriching both the local and global contexts in the popular two-stage temporal localization framework, where action proposals are first generated followed by action classification and temporal boundary regression. Our proposed model, dubbed ContextLoc, can be divided into three sub-networks: L-Net, G-Net and P-Net. L-Net enriches the local context via fine-grained modeling of snippet-level features, which is formulated as a query-and-retrieval process. G-Net enriches the global context via higher-level modeling of the video-level representation. In addition, we introduce a novel context adaptation module to adapt the global context to different proposals. P-Net further models the context-aware inter-proposal relations. We explore two existing models to be the P-Net in our experiments. The efficacy of our proposed method is validated by experimental results on the THUMOS14 (54.3% at [email protected]) and ActivityNet v1.3 (56.01% at [email protected]) datasets, which outperforms recent states of the art. Code is available at https://github.com/buxiangzhiren/ContextLoc.

Cite

Text

Zhu et al. "Enriching Local and Global Contexts for Temporal Action Localization." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.01326

Markdown

[Zhu et al. "Enriching Local and Global Contexts for Temporal Action Localization." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/zhu2021iccv-enriching/) doi:10.1109/ICCV48922.2021.01326

BibTeX

@inproceedings{zhu2021iccv-enriching,
  title     = {{Enriching Local and Global Contexts for Temporal Action Localization}},
  author    = {Zhu, Zixin and Tang, Wei and Wang, Le and Zheng, Nanning and Hua, Gang},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {13516-13525},
  doi       = {10.1109/ICCV48922.2021.01326},
  url       = {https://mlanthology.org/iccv/2021/zhu2021iccv-enriching/}
}