Hashing Algorithms for Large-Scale Learning

Ping Li, Anshumali Shrivastava, Joshua L. Moore, Arnd C. König

NeurIPS 2011 pp. 2672-2680

/neurips/2011/li2011neurips-hashing/

Abstract

Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare $b$-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

PDF NeurIPS Semantic Scholar

Cite

Text

Li et al. "Hashing Algorithms for Large-Scale Learning." Neural Information Processing Systems, 2011.

Markdown

[Li et al. "Hashing Algorithms for Large-Scale Learning." Neural Information Processing Systems, 2011.](https://mlanthology.org/neurips/2011/li2011neurips-hashing/)

BibTeX

@inproceedings{li2011neurips-hashing,
  title     = {{Hashing Algorithms for Large-Scale Learning}},
  author    = {Li, Ping and Shrivastava, Anshumali and Moore, Joshua L. and König, Arnd C.},
  booktitle = {Neural Information Processing Systems},
  year      = {2011},
  pages     = {2672-2680},
  url       = {https://mlanthology.org/neurips/2011/li2011neurips-hashing/}
}