Efficient Video Transformers with Spatial-Temporal Token Selection

Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang

ECCV 2022

doi:10.1007/978-3-031-19833-5_5 /eccv/2022/wang2022eccv-efficient-a/

Abstract

Video transformers have achieved impressive results on major video recognition benchmarks, however they suffer from high computational cost. In this paper, we present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples. Specifically, we formulate token selection as a ranking problem, which estimates the importance of each token through a lightweight scorer network and only those with top scores will be used for downstream evaluation. In the temporal dimension, we keep the frames that are most relevant to the action categories, while in the spatial dimension, we identify the most discriminative region in feature maps without affecting the spatial context used in a hierarchical way in most video transformers. Since the decision of token selection is non-differentiable, we employ a perturbed-maximum based differentiable Top-K operator for end-to-end training. We mainly conduct extensive experiments on Kinetics-400 with a recently introduced video transformer backbone, MViT. Our framework achieves similar results while requiring 20% less computation. We also demonstrate that our approach is generic for different transformer architectures and video datasets.

PDF ECCV Semantic Scholar

Cite

Text

Wang et al. "Efficient Video Transformers with Spatial-Temporal Token Selection." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19833-5_5

Markdown

[Wang et al. "Efficient Video Transformers with Spatial-Temporal Token Selection." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/wang2022eccv-efficient-a/) doi:10.1007/978-3-031-19833-5_5

BibTeX

@inproceedings{wang2022eccv-efficient-a,
  title     = {{Efficient Video Transformers with Spatial-Temporal Token Selection}},
  author    = {Wang, Junke and Yang, Xitong and Li, Hengduo and Liu, Li and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19833-5_5},
  url       = {https://mlanthology.org/eccv/2022/wang2022eccv-efficient-a/}
}