E2E-LOAD: End-to-End Long-Form Online Action Detection
Abstract
Recently, feature-based methods for Online Action Detection (OAD) have been gaining traction. However, these methods are constrained by their fixed backbone design, which fails to leverage the potential benefits of a trainable backbone. This paper introduces an end-to-end learning network that revises these approaches, incorporating a backbone network design that improves effectiveness and efficiency. Our proposed model utilizes a shared initial spatial model for all frames and maintains an extended sequence cache, which enables low-cost inference. We promote an asymmetric spatiotemporal model that caters to long-form and short-form modeling. Additionally, we propose an innovative and efficient inference mechanism that accelerates extensive spatiotemporal exploration. Through comprehensive ablation studies and experiments, we validate the performance and efficiency of our proposed method. Remarkably, we achieve an end-to-end learning OAD of 17.3 (+12.6) FPS with 72.4% (+1.2%), 90.3% (+0.7%), and 48.1% (+26.0%) mAP on THMOUS'14, TVSeries, and HDD, respectively.
Cite
Text
Cao et al. "E2E-LOAD: End-to-End Long-Form Online Action Detection." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00956Markdown
[Cao et al. "E2E-LOAD: End-to-End Long-Form Online Action Detection." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/cao2023iccv-e2eload/) doi:10.1109/ICCV51070.2023.00956BibTeX
@inproceedings{cao2023iccv-e2eload,
title = {{E2E-LOAD: End-to-End Long-Form Online Action Detection}},
author = {Cao, Shuqiang and Luo, Weixin and Wang, Bairui and Zhang, Wei and Ma, Lin},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {10422-10432},
doi = {10.1109/ICCV51070.2023.00956},
url = {https://mlanthology.org/iccv/2023/cao2023iccv-e2eload/}
}