A Sliding Window Scheme for Online Temporal Action Localization

Abstract

Most online video understanding tasks aim to immediately process each streaming frame and output predictions frame-by-frame. For extension to instance-level predictions of existing online video tasks, Online Temporal Action Localization (On-TAL) has been recently proposed. However, simple On-TAL approaches of grouping per-frame predictions have limitations due to the lack of instance-level context. To this end, we propose Online Anchor Transformer (OAT) to extend the anchor-based action localization model to the online setting. We also introduce an online-applicable post-processing method that suppresses repetitive action proposals. Evaluations of On-TAL on THUMOS’14, MUSES, and BBDB show significant improvements in terms of mAP, and our model shows comparable performance to the state-of-the-art offline TAL methods with a minor change of the post-processing method. In addition to mAP evaluation, we additionally present a new online-oriented metric of early detection for On-TAL, and measure the responsiveness of each On-TAL approach.

Cite

Text

Kim et al. "A Sliding Window Scheme for Online Temporal Action Localization." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19830-4

Markdown

[Kim et al. "A Sliding Window Scheme for Online Temporal Action Localization." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/kim2022eccv-sliding/) doi:10.1007/978-3-031-19830-4

BibTeX

@inproceedings{kim2022eccv-sliding,
  title     = {{A Sliding Window Scheme for Online Temporal Action Localization}},
  author    = {Kim, Young Hwi and Kang, Hyolim and Kim, Seon Joo},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19830-4},
  url       = {https://mlanthology.org/eccv/2022/kim2022eccv-sliding/}
}