Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers

Abstract

Weakly-Supervised Temporal Action Localization (WS-TAL) aims to jointly localize and classify action segments in untrimmed videos with only video-level annotations. To leverage video-level annotations, most existing methods adopt the multiple instance learning paradigm where frame-/snippet-level action predictions are first produced and then aggregated to form a video-level prediction. Although there are trials to improve snippet-level predictions by modeling temporal relationships, we argue that those implementations have not sufficiently exploited such information. In this paper, we propose Multi-Modal Plateau Transformers (M2PT) for WS-TAL by simultaneously exploiting temporal relationships among snippets, complementary information across data modalities, and temporal coherence among consecutive snippets. Specifically, M2PT explores a dual-Transformer architecture for RGB and optical flow modalities, which models intra-modality temporal relationship with a self-attention mechanism and inter-modality temporal relationship with a cross-attention mechanism. To capture the temporal coherence that consecutive snippets are supposed to be assigned with the same action, M2PT deploys a Plateau model to refine the temporal localization of action segments. Experimental results on popular benchmarks demonstrate that our proposed M2PT achieves state-of-the-art performance.

Cite

Text

Hu et al. "Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00276

Markdown

[Hu et al. "Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/hu2024cvprw-weaklysupervised/) doi:10.1109/CVPRW63382.2024.00276

BibTeX

@inproceedings{hu2024cvprw-weaklysupervised,
  title     = {{Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers}},
  author    = {Hu, Xin and Li, Kai and Patel, Deep and Kruus, Erik and Min, Martin Renqiang and Ding, Zhengming},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {2704-2713},
  doi       = {10.1109/CVPRW63382.2024.00276},
  url       = {https://mlanthology.org/cvprw/2024/hu2024cvprw-weaklysupervised/}
}