Adapting Short-Term Transformers for Action Detection in Untrimmed Videos
Abstract
Vision Transformer (ViT) has shown high potential in video recognition owing to its flexible design adaptable self-attention mechanisms and the efficacy of masked pre-training. Yet it remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each short-trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context. To mitigate this issue this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer to fully unleash its modeling power in capturing inter-snippet relation while still keeping low computation overhead and memory consumption for efficient TAD. To this end we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation we introduce a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone. For post-backbone information propagation we propose temporal transformer layers for further clip-level modeling. With the plain ViT-B pre-trained with VideoMAE our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors riching up to 69.5 average mAP on THUMOS14 37.40 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction.
Cite
Text
Yang et al. "Adapting Short-Term Transformers for Action Detection in Untrimmed Videos." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01757Markdown
[Yang et al. "Adapting Short-Term Transformers for Action Detection in Untrimmed Videos." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yang2024cvpr-adapting/) doi:10.1109/CVPR52733.2024.01757BibTeX
@inproceedings{yang2024cvpr-adapting,
title = {{Adapting Short-Term Transformers for Action Detection in Untrimmed Videos}},
author = {Yang, Min and Gao, Huan and Guo, Ping and Wang, Limin},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {18570-18579},
doi = {10.1109/CVPR52733.2024.01757},
url = {https://mlanthology.org/cvpr/2024/yang2024cvpr-adapting/}
}