Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation

Ding, Shuangrui; Qian, Rui; Xu, Haohang; Lin, Dahua; Xiong, Hongkai

doi:10.1007/978-3-031-72995-9_13

Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation

Shuangrui Ding, Rui Qian, Haohang Xu, Dahua Lin, Hongkai Xiong

ECCV 2024

doi:10.1007/978-3-031-72995-9_13 /eccv/2024/ding2024eccv-betrayed/

Abstract

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across three multi-object video segmentation tasks. Specifically, we achieve over 5 points of improvement in terms of FG-ARI on complex real-world DAVIS-17-Unsupervised and YouTube-VIS-19 compared to the previous best result. The code and checkpoint are released at https://github.com/shvdiwnkozbw/SSL-UVOS.

PDF ECCV Semantic Scholar

Cite

Text

Ding et al. "Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72995-9_13

Markdown

[Ding et al. "Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/ding2024eccv-betrayed/) doi:10.1007/978-3-031-72995-9_13

BibTeX

@inproceedings{ding2024eccv-betrayed,
  title     = {{Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation}},
  author    = {Ding, Shuangrui and Qian, Rui and Xu, Haohang and Lin, Dahua and Xiong, Hongkai},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72995-9_13},
  url       = {https://mlanthology.org/eccv/2024/ding2024eccv-betrayed/}
}