In Defense of Minhash over Simhash
Abstract
MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S 2
Cite
Text
Shrivastava and Li. "In Defense of Minhash over Simhash." International Conference on Artificial Intelligence and Statistics, 2014.Markdown
[Shrivastava and Li. "In Defense of Minhash over Simhash." International Conference on Artificial Intelligence and Statistics, 2014.](https://mlanthology.org/aistats/2014/shrivastava2014aistats-defense/)BibTeX
@inproceedings{shrivastava2014aistats-defense,
title = {{In Defense of Minhash over Simhash}},
author = {Shrivastava, Anshumali and Li, Ping},
booktitle = {International Conference on Artificial Intelligence and Statistics},
year = {2014},
pages = {886-894},
url = {https://mlanthology.org/aistats/2014/shrivastava2014aistats-defense/}
}