W-TALC: Weakly-Supervised Temporal Activity Localization and Classification
Abstract
Most activity localization methods in the literature suffer from the burden of frame-wise annotation requirement. Learning from weak labels may be a potential solution towards reducing such manual labeling effort. Recent years have witnessed a substantial influx of tagged videos on the Internet, which can serve as a rich source of weakly-supervised training data. Specifically, the correlations between videos with similar tags can be utilized to temporally localize the activities. Towards this goal, we present W-TALC, a Weakly-supervised Temporal Activity Localization and Classification framework using only video-level labels. The proposed network can be divided into two sub-networks, namely the Two-Stream based feature extractor network and a weakly-supervised module, which we learn by optimizing two complimentary loss functions. Qualitative and quantitative results on two challenging datasets - Thumos14 and ActivityNet1.2, demonstrate that the proposed method is able to detect activities at a fine granularity and achieve better performance than current state-of-the-art methods.
Cite
Text
Paul et al. "W-TALC: Weakly-Supervised Temporal Activity Localization and Classification." Proceedings of the European Conference on Computer Vision (ECCV), 2018. doi:10.1007/978-3-030-01225-0_35Markdown
[Paul et al. "W-TALC: Weakly-Supervised Temporal Activity Localization and Classification." Proceedings of the European Conference on Computer Vision (ECCV), 2018.](https://mlanthology.org/eccv/2018/paul2018eccv-wtalc/) doi:10.1007/978-3-030-01225-0_35BibTeX
@inproceedings{paul2018eccv-wtalc,
title = {{W-TALC: Weakly-Supervised Temporal Activity Localization and Classification}},
author = {Paul, Sujoy and Roy, Sourya and Roy-Chowdhury, Amit K.},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2018},
doi = {10.1007/978-3-030-01225-0_35},
url = {https://mlanthology.org/eccv/2018/paul2018eccv-wtalc/}
}