In Defense of Minhash over Simhash

Abstract

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S 2

Cite

Text

Shrivastava and Li. "In Defense of Minhash over Simhash." International Conference on Artificial Intelligence and Statistics, 2014.

Markdown

[Shrivastava and Li. "In Defense of Minhash over Simhash." International Conference on Artificial Intelligence and Statistics, 2014.](https://mlanthology.org/aistats/2014/shrivastava2014aistats-defense/)

BibTeX

@inproceedings{shrivastava2014aistats-defense,
  title     = {{In Defense of Minhash over Simhash}},
  author    = {Shrivastava, Anshumali and Li, Ping},
  booktitle = {International Conference on Artificial Intelligence and Statistics},
  year      = {2014},
  pages     = {886-894},
  url       = {https://mlanthology.org/aistats/2014/shrivastava2014aistats-defense/}
}