PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

Abstract

Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery. Training code and pre-trained models are available at https:// github.com/ananthu-aniraj/pdiscoformer.

Cite

Text

Aniraj et al. "PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73013-9_15

Markdown

[Aniraj et al. "PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/aniraj2024eccv-pdiscoformer/) doi:10.1007/978-3-031-73013-9_15

BibTeX

@inproceedings{aniraj2024eccv-pdiscoformer,
  title     = {{PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers}},
  author    = {Aniraj, Ananthu and Dantas, Cassio F. and Ienco, Dino and Marcos, Diego},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73013-9_15},
  url       = {https://mlanthology.org/eccv/2024/aniraj2024eccv-pdiscoformer/}
}