Towards Open-Vocabulary Video Instance Segmentation

Abstract

Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here https://github.com/haochenheheda/LVVIS.

Cite

Text

Wang et al. "Towards Open-Vocabulary Video Instance Segmentation." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00375

Markdown

[Wang et al. "Towards Open-Vocabulary Video Instance Segmentation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/wang2023iccv-openvocabulary-a/) doi:10.1109/ICCV51070.2023.00375

BibTeX

@inproceedings{wang2023iccv-openvocabulary-a,
  title     = {{Towards Open-Vocabulary Video Instance Segmentation}},
  author    = {Wang, Haochen and Yan, Cilin and Wang, Shuai and Jiang, Xiaolong and Tang, Xu and Hu, Yao and Xie, Weidi and Gavves, Efstratios},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {4057-4066},
  doi       = {10.1109/ICCV51070.2023.00375},
  url       = {https://mlanthology.org/iccv/2023/wang2023iccv-openvocabulary-a/}
}