C-MinHash: Improving Minwise Hashing with Circulant Permutation

Abstract

Minwise hashing (MinHash) is an important and practical algorithm for generating random hashes to approximate the Jaccard (resemblance) similarity in massive binary (0/1) data. The basic theory of MinHash requires applying hundreds or even thousands of independent random permutations to each data vector in the dataset, in order to obtain reliable results for (e.g.,) building large-scale learning models or approximate near neighbor search. In this paper, we propose Circulant MinHash (C-MinHash) and provide the surprising theoretical results that using only two independent random permutations in a circulant manner leads to uniformly smaller Jaccard estimation variance than that of the classical MinHash with K independent permutations. Experiments are conducted to show the effectiveness of the proposed method. We also propose a more convenient C-MinHash variant which reduces two permutations to just one, with extensive numerical results to validate that it achieves essentially the same estimation accuracy as using two permutations.

Cite

Text

Li and Li. "C-MinHash: Improving Minwise Hashing with Circulant Permutation." International Conference on Machine Learning, 2022.

Markdown

[Li and Li. "C-MinHash: Improving Minwise Hashing with Circulant Permutation." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/li2022icml-cminhash/)

BibTeX

@inproceedings{li2022icml-cminhash,
  title     = {{C-MinHash: Improving Minwise Hashing with Circulant Permutation}},
  author    = {Li, Xiaoyun and Li, Ping},
  booktitle = {International Conference on Machine Learning},
  year      = {2022},
  pages     = {12857-12887},
  volume    = {162},
  url       = {https://mlanthology.org/icml/2022/li2022icml-cminhash/}
}