Rejection Sampling for Weighted Jaccard Similarity Revisited

Abstract

Efficiently computing the weighted Jaccard similarity has become an active research topic in machine learning and theory. For sparse data, the standard technique is based on the consistent weighed sampling (CWS). For dense data, however, methods based on rejection sampling (RS) can be much more efficient. Nevertheless, existing RS methods are still slow for practical purposes. In this paper, we propose to improve RS by a strategy, which we call efficient rejection sampling (ERS), based on ``early stopping + densification''. We analyze the statistical property of ERS and provide experimental results to compare ERS with RS and other algorithms for hashing weighted Jaccard. The results demonstrate that ERS significantly improves the existing methods for estimating the weighted Jaccard similarity in relatively dense data.

Cite

Text

Li and Li. "Rejection Sampling for Weighted Jaccard Similarity Revisited." AAAI Conference on Artificial Intelligence, 2021. doi:10.1609/AAAI.V35I5.16543

Markdown

[Li and Li. "Rejection Sampling for Weighted Jaccard Similarity Revisited." AAAI Conference on Artificial Intelligence, 2021.](https://mlanthology.org/aaai/2021/li2021aaai-rejection/) doi:10.1609/AAAI.V35I5.16543

BibTeX

@inproceedings{li2021aaai-rejection,
  title     = {{Rejection Sampling for Weighted Jaccard Similarity Revisited}},
  author    = {Li, Xiaoyun and Li, Ping},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2021},
  pages     = {4197-4205},
  doi       = {10.1609/AAAI.V35I5.16543},
  url       = {https://mlanthology.org/aaai/2021/li2021aaai-rejection/}
}