Hashing Algorithms for Large-Scale Learning

Abstract

Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare $b$-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

Cite

Text

Li et al. "Hashing Algorithms for Large-Scale Learning." Neural Information Processing Systems, 2011.

Markdown

[Li et al. "Hashing Algorithms for Large-Scale Learning." Neural Information Processing Systems, 2011.](https://mlanthology.org/neurips/2011/li2011neurips-hashing/)

BibTeX

@inproceedings{li2011neurips-hashing,
  title     = {{Hashing Algorithms for Large-Scale Learning}},
  author    = {Li, Ping and Shrivastava, Anshumali and Moore, Joshua L. and König, Arnd C.},
  booktitle = {Neural Information Processing Systems},
  year      = {2011},
  pages     = {2672-2680},
  url       = {https://mlanthology.org/neurips/2011/li2011neurips-hashing/}
}