A Bayesian Nonparametric Approach to Count-Min Sketch Under Power-Law Data Streams

Abstract

The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens’ frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token’s frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.

Cite

Text

Dolera et al. "A Bayesian Nonparametric Approach to Count-Min Sketch Under Power-Law Data Streams." Artificial Intelligence and Statistics, 2021.

Markdown

[Dolera et al. "A Bayesian Nonparametric Approach to Count-Min Sketch Under Power-Law Data Streams." Artificial Intelligence and Statistics, 2021.](https://mlanthology.org/aistats/2021/dolera2021aistats-bayesian/)

BibTeX

@inproceedings{dolera2021aistats-bayesian,
  title     = {{A Bayesian Nonparametric Approach to Count-Min Sketch Under Power-Law Data Streams}},
  author    = {Dolera, Emanuele and Favaro, Stefano and Peluchetti, Stefano},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2021},
  pages     = {226-234},
  volume    = {130},
  url       = {https://mlanthology.org/aistats/2021/dolera2021aistats-bayesian/}
}