Streaming Video Model

Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, Zheng-Jun Zha

CVPR 2023 pp. 14602-14612

doi:10.1109/CVPR52729.2023.01403 /cvpr/2023/zhao2023cvpr-streaming/

Abstract

Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract spatiotemporal features, while frame-based video tasks, such as multiple object tracking (MOT), rely on single fixed-image backbone to extract spatial features. In contrast, we propose to unify video understanding tasks into one novel streaming video architecture, referred to as Streaming Vision Transformer (S-ViT). S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task. We believe that the concept of streaming video model and the implementation of S-ViT are solid steps towards a unified deep learning architecture for video understanding. Code will be available at https://github.com/yuzhms/Streaming-Video-Model.

PDF CVPR Semantic Scholar

Cite

Text

Zhao et al. "Streaming Video Model." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01403

Markdown

[Zhao et al. "Streaming Video Model." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/zhao2023cvpr-streaming/) doi:10.1109/CVPR52729.2023.01403

BibTeX

@inproceedings{zhao2023cvpr-streaming,
  title     = {{Streaming Video Model}},
  author    = {Zhao, Yucheng and Luo, Chong and Tang, Chuanxin and Chen, Dongdong and Codella, Noel and Zha, Zheng-Jun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {14602-14612},
  doi       = {10.1109/CVPR52729.2023.01403},
  url       = {https://mlanthology.org/cvpr/2023/zhao2023cvpr-streaming/}
}