One Permutation Hashing

Abstract

While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) $k=500$ permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing $k$ permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM \& logistic regression also confirm the theoretical results.

Cite

Text

Li et al. "One Permutation Hashing." Neural Information Processing Systems, 2012.

Markdown

[Li et al. "One Permutation Hashing." Neural Information Processing Systems, 2012.](https://mlanthology.org/neurips/2012/li2012neurips-one/)

BibTeX

@inproceedings{li2012neurips-one,
  title     = {{One Permutation Hashing}},
  author    = {Li, Ping and Owen, Art and Zhang, Cun-hui},
  booktitle = {Neural Information Processing Systems},
  year      = {2012},
  pages     = {3113-3121},
  url       = {https://mlanthology.org/neurips/2012/li2012neurips-one/}
}