Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation
Abstract
In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across three multi-object video segmentation tasks. Specifically, we achieve over 5 points of improvement in terms of FG-ARI on complex real-world DAVIS-17-Unsupervised and YouTube-VIS-19 compared to the previous best result. The code and checkpoint are released at https://github.com/shvdiwnkozbw/SSL-UVOS.
Cite
Text
Ding et al. "Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72995-9_13Markdown
[Ding et al. "Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/ding2024eccv-betrayed/) doi:10.1007/978-3-031-72995-9_13BibTeX
@inproceedings{ding2024eccv-betrayed,
title = {{Betrayed by Attention: A Simple yet Effective Approach for Self-Supervised Video Object Segmentation}},
author = {Ding, Shuangrui and Qian, Rui and Xu, Haohang and Lin, Dahua and Xiong, Hongkai},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72995-9_13},
url = {https://mlanthology.org/eccv/2024/ding2024eccv-betrayed/}
}