Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers
Abstract
Weakly-Supervised Temporal Action Localization (WS-TAL) aims to jointly localize and classify action segments in untrimmed videos with only video-level annotations. To leverage video-level annotations, most existing methods adopt the multiple instance learning paradigm where frame-/snippet-level action predictions are first produced and then aggregated to form a video-level prediction. Although there are trials to improve snippet-level predictions by modeling temporal relationships, we argue that those implementations have not sufficiently exploited such information. In this paper, we propose Multi-Modal Plateau Transformers (M2PT) for WS-TAL by simultaneously exploiting temporal relationships among snippets, complementary information across data modalities, and temporal coherence among consecutive snippets. Specifically, M2PT explores a dual-Transformer architecture for RGB and optical flow modalities, which models intra-modality temporal relationship with a self-attention mechanism and inter-modality temporal relationship with a cross-attention mechanism. To capture the temporal coherence that consecutive snippets are supposed to be assigned with the same action, M2PT deploys a Plateau model to refine the temporal localization of action segments. Experimental results on popular benchmarks demonstrate that our proposed M2PT achieves state-of-the-art performance.
Cite
Text
Hu et al. "Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00276Markdown
[Hu et al. "Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/hu2024cvprw-weaklysupervised/) doi:10.1109/CVPRW63382.2024.00276BibTeX
@inproceedings{hu2024cvprw-weaklysupervised,
title = {{Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers}},
author = {Hu, Xin and Li, Kai and Patel, Deep and Kruus, Erik and Min, Martin Renqiang and Ding, Zhengming},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {2704-2713},
doi = {10.1109/CVPRW63382.2024.00276},
url = {https://mlanthology.org/cvprw/2024/hu2024cvprw-weaklysupervised/}
}