Learning-Based Support Estimation in Sublinear Time
Abstract
We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to $ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. $ We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from Hsu et al, ICLR'19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.
Cite
Text
Eden et al. "Learning-Based Support Estimation in Sublinear Time." International Conference on Learning Representations, 2021.Markdown
[Eden et al. "Learning-Based Support Estimation in Sublinear Time." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/eden2021iclr-learningbased/)BibTeX
@inproceedings{eden2021iclr-learningbased,
title = {{Learning-Based Support Estimation in Sublinear Time}},
author = {Eden, Talya and Indyk, Piotr and Narayanan, Shyam and Rubinfeld, Ronitt and Silwal, Sandeep and Wagner, Tal},
booktitle = {International Conference on Learning Representations},
year = {2021},
url = {https://mlanthology.org/iclr/2021/eden2021iclr-learningbased/}
}