The Hidden Pitfalls of the Cosine Similarity Loss

Abstract

We show that the gradient of the cosine similarity between two points goes to zero in two unexpected settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.

Cite

Text

Draganov et al. "The Hidden Pitfalls of the Cosine Similarity Loss." ICML 2024 Workshops: HiLD, 2024.

Markdown

[Draganov et al. "The Hidden Pitfalls of the Cosine Similarity Loss." ICML 2024 Workshops: HiLD, 2024.](https://mlanthology.org/icmlw/2024/draganov2024icmlw-hidden/)

BibTeX

@inproceedings{draganov2024icmlw-hidden,
  title     = {{The Hidden Pitfalls of the Cosine Similarity Loss}},
  author    = {Draganov, Andrew and Vadgama, Sharvaree and Bekkers, Erik J},
  booktitle = {ICML 2024 Workshops: HiLD},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/draganov2024icmlw-hidden/}
}