Efficient Video Instance Segmentation via Tracklet Query and Proposal

Abstract

Video Instance Segmentation (VIS) aims to simultaneously classify, segment, and track multiple object instances in videos. Recent clip-level VIS takes a short video clip as input each time showing stronger performance than frame-level VIS (tracking-by-segmentation), as more temporal context from multiple frames is utilized. Yet, most clip-level methods are neither end-to-end learnable nor real-time. These limitations are addressed by the recent VIS transformer (VisTR) which performs VIS end-to-end within a clip. However, VisTR suffers from long training time due to its frame-wise dense attention. In addition, VisTR is not fully end-to-end learnable in multiple video clips as it requires a hand-crafted data association to link instance tracklets between successive clips. This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference. At the core are tracklet query and tracklet proposal that associate and segment regions-of-interest (RoIs) across space and time by an iterative query-video interaction. We further propose a correspondence learning that makes tracklets linking between clips end-to-end learnable. Compared to VisTR, EfficientVIS requires 15x fewer training epochs while achieving state-of-the-art accuracy on the YouTube-VIS benchmark. Meanwhile, our method enables whole video instance segmentation in a single end-to-end pass without data association at all.

Cite

Text

Wu et al. "Efficient Video Instance Segmentation via Tracklet Query and Proposal." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00103

Markdown

[Wu et al. "Efficient Video Instance Segmentation via Tracklet Query and Proposal." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/wu2022cvpr-efficient/) doi:10.1109/CVPR52688.2022.00103

BibTeX

@inproceedings{wu2022cvpr-efficient,
  title     = {{Efficient Video Instance Segmentation via Tracklet Query and Proposal}},
  author    = {Wu, Jialian and Yarram, Sudhir and Liang, Hui and Lan, Tian and Yuan, Junsong and Eledath, Jayan and Medioni, Gérard},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {959-968},
  doi       = {10.1109/CVPR52688.2022.00103},
  url       = {https://mlanthology.org/cvpr/2022/wu2022cvpr-efficient/}
}