Searching for Two-Stream Models in Multivariate Space for Video Recognition
Abstract
Conventional video models rely on a single stream to capture the complex spatial-temporal features. Recent work on two-stream video models, such as SlowFast network and AssembleNet, prescribe separate streams to learn complementary features, and achieve stronger performance. However, manually designing both streams as well as the in-between fusion blocks is a daunting task, requiring to explore a tremendously large design space. Such manual exploration is time-consuming and often ends up with sub-optimal architectures when computational resources are limited and the exploration is insufficient. In this work, we present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently. We design a multivariate search space, including 6 search variables to capture a wide variety of choices in designing two-stream models. Furthermore, we propose a progressive search procedure, by searching for the architecture of individual streams, fusion blocks and attention blocks one after the other. We demonstrate two-stream models with significantly better performance can be automatically discovered in our design space. Our searched two-stream models, namely Auto-TSNet, consistently outperform other models on standard benchmarks. On Kinetics, compared with the SlowFast model, our Auto-TSNet-L model reduces FLOPS by nearly 11 times while achieving the same accuracy 78.9%. On Something-Something-V2, Auto-TSNet-M improves the accuracy by at least 2% over other methods which use less than 50 GFLOPS per video.
Cite
Text
Gong et al. "Searching for Two-Stream Models in Multivariate Space for Video Recognition." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00793Markdown
[Gong et al. "Searching for Two-Stream Models in Multivariate Space for Video Recognition." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/gong2021iccv-searching/) doi:10.1109/ICCV48922.2021.00793BibTeX
@inproceedings{gong2021iccv-searching,
title = {{Searching for Two-Stream Models in Multivariate Space for Video Recognition}},
author = {Gong, Xinyu and Wang, Heng and Shou, Mike Zheng and Feiszli, Matt and Wang, Zhangyang and Yan, Zhicheng},
booktitle = {International Conference on Computer Vision},
year = {2021},
pages = {8033-8042},
doi = {10.1109/ICCV48922.2021.00793},
url = {https://mlanthology.org/iccv/2021/gong2021iccv-searching/}
}