Temporally Efficient Vision Transformer for Video Instance Segmentation

Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu Liu, Xun Zhao, Ying Shan

CVPR 2022 pp. 2885-2895

doi:10.1109/CVPR52688.2022.00290 /cvpr/2022/yang2022cvpr-temporally/

Abstract

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https:// github.com/hustvl/TeViT.

PDF CVPR Semantic Scholar

Cite

Text

Yang et al. "Temporally Efficient Vision Transformer for Video Instance Segmentation." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00290

Markdown

[Yang et al. "Temporally Efficient Vision Transformer for Video Instance Segmentation." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/yang2022cvpr-temporally/) doi:10.1109/CVPR52688.2022.00290

BibTeX

@inproceedings{yang2022cvpr-temporally,
  title     = {{Temporally Efficient Vision Transformer for Video Instance Segmentation}},
  author    = {Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu, Wenyu and Zhao, Xun and Shan, Ying},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {2885-2895},
  doi       = {10.1109/CVPR52688.2022.00290},
  url       = {https://mlanthology.org/cvpr/2022/yang2022cvpr-temporally/}
}