Scaling up Simhash
Abstract
The seminal work of (Charikar, 2002) gives a space efficient sketching algorithm (Simhash) which compresses real-valued vectors to binary vectors while maintaining an estimate of the Cosine similarity between any pairs of original real-valued vectors. In this work, we propose a sketching algorithm – Simsketch – that can be applied on top of the results obtained from Simhash. This further reduces the data dimension while maintaining an estimate of the Cosine similarity between original real-valued vectors. As a consequence, it helps in scaling up the performance of Simhash. We present theoretical bounds of our result and complement it with experimentation on public datasets. Our proposed algorithm is simple, efficient, and therefore can be adopted in practice.
Cite
Text
Pratap et al. "Scaling up Simhash." Proceedings of The 12th Asian Conference on Machine Learning, 2020.Markdown
[Pratap et al. "Scaling up Simhash." Proceedings of The 12th Asian Conference on Machine Learning, 2020.](https://mlanthology.org/acml/2020/pratap2020acml-scaling/)BibTeX
@inproceedings{pratap2020acml-scaling,
title = {{Scaling up Simhash}},
author = {Pratap, Rameshwar and Deshmukh, Anup and Nair, Pratheeksha and Ravi, Anirudh},
booktitle = {Proceedings of The 12th Asian Conference on Machine Learning},
year = {2020},
pages = {705-720},
volume = {129},
url = {https://mlanthology.org/acml/2020/pratap2020acml-scaling/}
}