Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization

Abstract

Temporally localizing actions in videos is one of the key components for video understanding. Learning from weakly-labeled data is seen as a potential solution towards avoiding expensive frame-level annotations. Different from other works which only depend on visual-modality, we propose to learn richer audiovisual representation for weakly-supervised action localization. First, we propose a multi-stage cross-attention mechanism to collaboratively fuse audio and visual features, which preserves the intra-modal characteristics. Second, to model both foreground and background frames, we construct an open-max classifier that treats the background class as an open-set. Third, for precise action localization, we design consistency losses to enforce temporal continuity for the action class prediction, and also help with foreground-prediction reliability. Extensive experiments on two publicly available video-datasets (AVE and ActivityNet1.2) show that the proposed method effectively fuses audio and visual modalities, and achieves the state-of-the-art results for weakly-supervised action localization.

Cite

Text

Lee et al. "Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization." International Conference on Learning Representations, 2021.

Markdown

[Lee et al. "Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/lee2021iclr-crossattentional/)

BibTeX

@inproceedings{lee2021iclr-crossattentional,
  title     = {{Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization}},
  author    = {Lee, Jun-Tae and Jain, Mihir and Park, Hyoungwoo and Yun, Sungrack},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://mlanthology.org/iclr/2021/lee2021iclr-crossattentional/}
}