DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency

Abstract

In this paper, we propose a simple yet effective transformer framework for self-supervised learning called DenseDINO to learn dense visual representations. To exploit the spatial information that the dense prediction tasks require but neglected by the existing self-supervised transformers, we introduce point-level supervision across views in a novel token-based way. Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior. With the reference token, the model could maintain spatial consistency and deal with multi-object complex scene images, thus generalizing better on dense prediction tasks. Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet and achieves a large margin (+7.2% mIoU) improvement in semantic segmentation on PascalVOC under the linear probing protocol for segmentation.

Cite

Text

Yuan et al. "DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency." International Joint Conference on Artificial Intelligence, 2023. doi:10.24963/IJCAI.2023/188

Markdown

[Yuan et al. "DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency." International Joint Conference on Artificial Intelligence, 2023.](https://mlanthology.org/ijcai/2023/yuan2023ijcai-densedino/) doi:10.24963/IJCAI.2023/188

BibTeX

@inproceedings{yuan2023ijcai-densedino,
  title     = {{DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency}},
  author    = {Yuan, Yike and Fu, Xinghe and Yu, Yunlong and Li, Xi},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {1695-1703},
  doi       = {10.24963/IJCAI.2023/188},
  url       = {https://mlanthology.org/ijcai/2023/yuan2023ijcai-densedino/}
}