Sample Complexity of Population Recovery
Abstract
The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions. We consider one of the two polling impediments: \beginitemize \it{em} in lossy population recovery, a pollee may skip each question with probability $ε$; \it{em} in noisy population recovery, a pollee may lie on each question with probability $ε$. \enditemize Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of all $2^d$ binary vectors simultaneously within accuracy $δ$ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is $\tildeΘ(δ^ -2\max{\fracε1-ε,1})$, improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result is dimension-free. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $ε$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be dimension-dependent and scales as $\exp(Θ(d^1/3 \log^2/3(1/δ)))$ except for the trivial cases of $ε=0,1/2$ or $1$. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam’s method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.
Cite
Text
Polyanskiy et al. "Sample Complexity of Population Recovery." Proceedings of the 2017 Conference on Learning Theory, 2017.Markdown
[Polyanskiy et al. "Sample Complexity of Population Recovery." Proceedings of the 2017 Conference on Learning Theory, 2017.](https://mlanthology.org/colt/2017/polyanskiy2017colt-sample/)BibTeX
@inproceedings{polyanskiy2017colt-sample,
title = {{Sample Complexity of Population Recovery}},
author = {Polyanskiy, Yury and Suresh, Ananda Theertha and Wu, Yihong},
booktitle = {Proceedings of the 2017 Conference on Learning Theory},
year = {2017},
pages = {1589-1618},
volume = {65},
url = {https://mlanthology.org/colt/2017/polyanskiy2017colt-sample/}
}