Spatial-Temporal Multi-Level Association for Video Object Segmentation

Abstract

Existing semi-supervised video object segmentation methods either focus on temporal feature matching or spatial-temporal feature modeling. However, they do not address the issues of sufficient target interaction and efficient parallel processing simultaneously, thereby constraining the learning of dynamic, target-aware features. To tackle these limitations, this paper proposes a spatial-temporal multi-level association framework, which jointly associates reference frame, test frame, and object features to achieve sufficient interaction and parallel target ID association with a spatial-temporal memory bank for efficient video object segmentation. Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features, which formulates feature extraction and interaction as the efficient operations of object self-attention, reference object enhancement, and test reference correlation. In addition, we propose a spatial-temporal memory to assist feature association and temporal ID assignment and correlation. We evaluate the proposed method by conducting extensive experiments on numerous video object segmentation datasets, including DAVIS 2016/2017 val, DAVIS 2017 test-dev, and YouTube-VOS 2018/2019 val. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of our approach. All source code and trained models will be made publicly available.

Cite

Text

Miao et al. "Spatial-Temporal Multi-Level Association for Video Object Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72855-6_6

Markdown

[Miao et al. "Spatial-Temporal Multi-Level Association for Video Object Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/miao2024eccv-spatialtemporal/) doi:10.1007/978-3-031-72855-6_6

BibTeX

@inproceedings{miao2024eccv-spatialtemporal,
  title     = {{Spatial-Temporal Multi-Level Association for Video Object Segmentation}},
  author    = {Miao, Deshui and Li, Xin and He, Zhenyu and Lu, Huchuan and Yang, Ming-Hsuan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72855-6_6},
  url       = {https://mlanthology.org/eccv/2024/miao2024eccv-spatialtemporal/}
}