DINO as a Von Mises-Fisher Mixture Model

Abstract

Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.

Cite

Text

Govindarajan et al. "DINO as a Von Mises-Fisher Mixture Model." International Conference on Learning Representations, 2023.

Markdown

[Govindarajan et al. "DINO as a Von Mises-Fisher Mixture Model." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/govindarajan2023iclr-dino/)

BibTeX

@inproceedings{govindarajan2023iclr-dino,
  title     = {{DINO as a Von Mises-Fisher Mixture Model}},
  author    = {Govindarajan, Hariprasath and Sidén, Per and Roll, Jacob and Lindsten, Fredrik},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/govindarajan2023iclr-dino/}
}