Weakly-Supervised Action Localization by Hierarchically-Structured Latent Attention Modeling

Abstract

Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.

Cite

Text

Wang et al. "Weakly-Supervised Action Localization by Hierarchically-Structured Latent Attention Modeling." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00936

Markdown

[Wang et al. "Weakly-Supervised Action Localization by Hierarchically-Structured Latent Attention Modeling." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/wang2023iccv-weaklysupervised/) doi:10.1109/ICCV51070.2023.00936

BibTeX

@inproceedings{wang2023iccv-weaklysupervised,
  title     = {{Weakly-Supervised Action Localization by Hierarchically-Structured Latent Attention Modeling}},
  author    = {Wang, Guiqin and Zhao, Peng and Zhao, Cong and Yang, Shusen and Cheng, Jie and Leng, Luziwei and Liao, Jianxing and Guo, Qinghai},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {10203-10213},
  doi       = {10.1109/ICCV51070.2023.00936},
  url       = {https://mlanthology.org/iccv/2023/wang2023iccv-weaklysupervised/}
}