On the Effectiveness of ViT Features as Local Semantic Descriptors

Abstract

We study the use of deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts ; (ii) the encoded semantic information is shared across related, yet different object categories , and (iii) positional bias changes gradually throughout the layers . These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in https://dino-vit-features.github.io/ .

Cite

Text

Amir et al. "On the Effectiveness of ViT Features as Local Semantic Descriptors." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25069-9_3

Markdown

[Amir et al. "On the Effectiveness of ViT Features as Local Semantic Descriptors." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/amir2022eccvw-effectiveness/) doi:10.1007/978-3-031-25069-9_3

BibTeX

@inproceedings{amir2022eccvw-effectiveness,
  title     = {{On the Effectiveness of ViT Features as Local Semantic Descriptors}},
  author    = {Amir, Shir and Gandelsman, Yossi and Bagon, Shai and Dekel, Tali},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2022},
  pages     = {39-55},
  doi       = {10.1007/978-3-031-25069-9_3},
  url       = {https://mlanthology.org/eccvw/2022/amir2022eccvw-effectiveness/}
}