Re-Randomized Densification for One Permutation Hashing and Bin-Wise Consistent Weighted Sampling

Abstract

Jaccard similarity is widely used as a distance measure in many machine learning and search applications. Typically, hashing methods are essential for the use of Jaccard similarity to be practical in large-scale settings. For hashing binary (0/1) data, the idea of one permutation hashing (OPH) with densification significantly accelerates traditional minwise hashing algorithms while providing unbiased and accurate estimates. In this paper, we propose a strategy named “re-randomization” in the process of densification that could achieve the smallest variance among all densification schemes. The success of this idea naturally inspires us to generalize one permutation hashing to weighted (non-binary) data, which results in the socalled “bin-wise consistent weighted sampling (BCWS)” algorithm. We analyze the behavior of BCWS and compare it with a recent alternative. Extensive experiments on various datasets illustrates the effectiveness of our proposed methods.

Cite

Text

Li et al. "Re-Randomized Densification for One Permutation Hashing and Bin-Wise Consistent Weighted Sampling." Neural Information Processing Systems, 2019.

Markdown

[Li et al. "Re-Randomized Densification for One Permutation Hashing and Bin-Wise Consistent Weighted Sampling." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/li2019neurips-rerandomized/)

BibTeX

@inproceedings{li2019neurips-rerandomized,
  title     = {{Re-Randomized Densification for One Permutation Hashing and Bin-Wise Consistent Weighted Sampling}},
  author    = {Li, Ping and Li, Xiaoyun and Zhang, Cun-Hui},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {15926-15936},
  url       = {https://mlanthology.org/neurips/2019/li2019neurips-rerandomized/}
}