Searching for Two-Stream Models in Multivariate Space for Video Recognition

Gong, Xinyu; Wang, Heng; Shou, Mike Zheng; Feiszli, Matt; Wang, Zhangyang; Yan, Zhicheng

doi:10.1109/ICCV48922.2021.00793

Searching for Two-Stream Models in Multivariate Space for Video Recognition

Xinyu Gong, Heng Wang, Mike Zheng Shou, Matt Feiszli, Zhangyang Wang, Zhicheng Yan

ICCV 2021 pp. 8033-8042

doi:10.1109/ICCV48922.2021.00793 /iccv/2021/gong2021iccv-searching/

Abstract

Conventional video models rely on a single stream to capture the complex spatial-temporal features. Recent work on two-stream video models, such as SlowFast network and AssembleNet, prescribe separate streams to learn complementary features, and achieve stronger performance. However, manually designing both streams as well as the in-between fusion blocks is a daunting task, requiring to explore a tremendously large design space. Such manual exploration is time-consuming and often ends up with sub-optimal architectures when computational resources are limited and the exploration is insufficient. In this work, we present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently. We design a multivariate search space, including 6 search variables to capture a wide variety of choices in designing two-stream models. Furthermore, we propose a progressive search procedure, by searching for the architecture of individual streams, fusion blocks and attention blocks one after the other. We demonstrate two-stream models with significantly better performance can be automatically discovered in our design space. Our searched two-stream models, namely Auto-TSNet, consistently outperform other models on standard benchmarks. On Kinetics, compared with the SlowFast model, our Auto-TSNet-L model reduces FLOPS by nearly 11 times while achieving the same accuracy 78.9%. On Something-Something-V2, Auto-TSNet-M improves the accuracy by at least 2% over other methods which use less than 50 GFLOPS per video.

PDF ICCV Semantic Scholar

Cite

Text

Gong et al. "Searching for Two-Stream Models in Multivariate Space for Video Recognition." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00793

Markdown

[Gong et al. "Searching for Two-Stream Models in Multivariate Space for Video Recognition." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/gong2021iccv-searching/) doi:10.1109/ICCV48922.2021.00793

BibTeX

@inproceedings{gong2021iccv-searching,
  title     = {{Searching for Two-Stream Models in Multivariate Space for Video Recognition}},
  author    = {Gong, Xinyu and Wang, Heng and Shou, Mike Zheng and Feiszli, Matt and Wang, Zhangyang and Yan, Zhicheng},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {8033-8042},
  doi       = {10.1109/ICCV48922.2021.00793},
  url       = {https://mlanthology.org/iccv/2021/gong2021iccv-searching/}
}