NeXtVLAD: An Efficient Neural Network to Aggregate Frame-Level Features for Large-Scale Video Classification
Abstract
This paper introduces a fast and efficient network architecture, NeXtVLAD, to aggregate frame-level features into a compact feature vector for large-scale video classification. Briefly speaking, the basic idea is to decompose a high-dimensional feature into a group of relatively low-dimensional vectors with attention before applying NetVLAD aggregation over time. This NeXtVLAD approach turns out to be both effective and parameter efficient in aggregating temporal information. In the 2nd Youtube-8M video understanding challenge, a single NeXtVLAD model with less than 80M parameters achieves a GAP score of 0.87846 in private leaderboard. A mixture of 3 NeXtVLAD models results in 0.88722, which is ranked 3rd over 394 teams. The code is publicly available at https://github.com/linrongc/youtube-8m.
Cite
Text
Lin et al. "NeXtVLAD: An Efficient Neural Network to Aggregate Frame-Level Features for Large-Scale Video Classification." European Conference on Computer Vision Workshops, 2018. doi:10.1007/978-3-030-11018-5_19Markdown
[Lin et al. "NeXtVLAD: An Efficient Neural Network to Aggregate Frame-Level Features for Large-Scale Video Classification." European Conference on Computer Vision Workshops, 2018.](https://mlanthology.org/eccvw/2018/lin2018eccvw-nextvlad/) doi:10.1007/978-3-030-11018-5_19BibTeX
@inproceedings{lin2018eccvw-nextvlad,
title = {{NeXtVLAD: An Efficient Neural Network to Aggregate Frame-Level Features for Large-Scale Video Classification}},
author = {Lin, Rongcheng and Xiao, Jing and Fan, Jianping},
booktitle = {European Conference on Computer Vision Workshops},
year = {2018},
pages = {206-218},
doi = {10.1007/978-3-030-11018-5_19},
url = {https://mlanthology.org/eccvw/2018/lin2018eccvw-nextvlad/}
}