Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition
Abstract
Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures. This paper presents an in-depth study of using large volumes of web videos for pre-training video models for the task of action recognition. Our primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets. Further, we examine three questions in the construction of weakly-supervised video action datasets. First, given that actions involve interactions with objects, how should one construct a verb-object pre-training label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning? Finally, actions are generally less well-localized in long videos vs. short videos; since action labels are provided at a video level, how should one choose video clips for best performance, given some fixed budget of number or minutes of videos?
Cite
Text
Ghadiyaram et al. "Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.01232Markdown
[Ghadiyaram et al. "Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/ghadiyaram2019cvpr-largescale/) doi:10.1109/CVPR.2019.01232BibTeX
@inproceedings{ghadiyaram2019cvpr-largescale,
title = {{Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition}},
author = {Ghadiyaram, Deepti and Tran, Du and Mahajan, Dhruv},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2019},
doi = {10.1109/CVPR.2019.01232},
url = {https://mlanthology.org/cvpr/2019/ghadiyaram2019cvpr-largescale/}
}