Large-Scale Video Classification with Convolutional Neural Networks
Abstract
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
Cite
Text
Karpathy et al. "Large-Scale Video Classification with Convolutional Neural Networks." Conference on Computer Vision and Pattern Recognition, 2014. doi:10.1109/CVPR.2014.223Markdown
[Karpathy et al. "Large-Scale Video Classification with Convolutional Neural Networks." Conference on Computer Vision and Pattern Recognition, 2014.](https://mlanthology.org/cvpr/2014/karpathy2014cvpr-largescale/) doi:10.1109/CVPR.2014.223BibTeX
@inproceedings{karpathy2014cvpr-largescale,
title = {{Large-Scale Video Classification with Convolutional Neural Networks}},
author = {Karpathy, Andrej and Toderici, George and Shetty, Sanketh and Leung, Thomas and Sukthankar, Rahul and Fei-Fei, Li},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2014},
doi = {10.1109/CVPR.2014.223},
url = {https://mlanthology.org/cvpr/2014/karpathy2014cvpr-largescale/}
}