TubeFormer-DeepLab: Video Mask Transformer

Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen

CVPR 2022 pp. 13914-13924

doi:10.1109/CVPR52688.2022.01354 /cvpr/2022/kim2022cvpr-tubeformerdeeplab/

Abstract

We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner. Different video segmentation tasks (e.g., video semantic/instance/panoptic segmentation) are usually considered as distinct problems. State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task. By contrast, we make a crucial observation that video segmentation tasks could be generally formulated as the problem of assigning different predicted labels to video tubes (where a tube is obtained by linking segmentation masks along the time axis) and the labels may encode different values depending on the target task. The observation motivates us to develop TubeFormer-DeepLab, a simple and effective video mask transformer model that is widely applicable to multiple video segmentation tasks. TubeFormer-DeepLab directly predicts video tubes with task-specific labels (either pure semantic categories, or both semantic categories and instance identities), which not only significantly simplifies video segmentation models, but also advances state-of-the-art results on multiple video segmentation benchmarks.

PDF CVPR Semantic Scholar

Cite

Text

Kim et al. "TubeFormer-DeepLab: Video Mask Transformer." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01354

Markdown

[Kim et al. "TubeFormer-DeepLab: Video Mask Transformer." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/kim2022cvpr-tubeformerdeeplab/) doi:10.1109/CVPR52688.2022.01354

BibTeX

@inproceedings{kim2022cvpr-tubeformerdeeplab,
  title     = {{TubeFormer-DeepLab: Video Mask Transformer}},
  author    = {Kim, Dahun and Xie, Jun and Wang, Huiyu and Qiao, Siyuan and Yu, Qihang and Kim, Hong-Seok and Adam, Hartwig and Kweon, In So and Chen, Liang-Chieh},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {13914-13924},
  doi       = {10.1109/CVPR52688.2022.01354},
  url       = {https://mlanthology.org/cvpr/2022/kim2022cvpr-tubeformerdeeplab/}
}