Randomness Efficient Feature Hashing for Sparse Binary Data

Rameshwar Pratap, Karthik Revanuru, Anirudh Ravi, Raghav Kulkarni

ACML 2020 pp. 689-704

/acml/2020/pratap2020acml-randomness/

Abstract

We present sketching algorithms for sparse binary datasets, which maintain binary version of the dataset after sketching, while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in dimension, while other competitive algorithms require linear in dimension. Our proposed algorithms are efficient, offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset. We empirically verify the assumption and notice that it holds on several real-world datasets.

PDF ACML Semantic Scholar

Cite

Text

Pratap et al. "Randomness Efficient Feature Hashing for Sparse Binary Data." Proceedings of The 12th Asian Conference on Machine Learning, 2020.

Markdown

[Pratap et al. "Randomness Efficient Feature Hashing for Sparse Binary Data." Proceedings of The 12th Asian Conference on Machine Learning, 2020.](https://mlanthology.org/acml/2020/pratap2020acml-randomness/)

BibTeX

@inproceedings{pratap2020acml-randomness,
  title     = {{Randomness Efficient Feature Hashing for Sparse Binary Data}},
  author    = {Pratap, Rameshwar and Revanuru, Karthik and Ravi, Anirudh and Kulkarni, Raghav},
  booktitle = {Proceedings of The 12th Asian Conference on Machine Learning},
  year      = {2020},
  pages     = {689-704},
  volume    = {129},
  url       = {https://mlanthology.org/acml/2020/pratap2020acml-randomness/}
}