Data Amplification: Instance-Optimal Property Estimation

Abstract

The best-known and most commonly used technique for distribution-property estimation uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly “amplify” the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlying distribution, they achieve the accuracy that the empirical-frequency plug-in estimators would attain using a logarithmic-factor more samples. Specifically, for Shannon entropy and a broad class of Lipschitz properties including the $L_1$ distance to a fixed distribution, the new estimators use $n$ samples to achieve the accuracy attained by the empirical estimators with $n\log n$ samples. For support-size and coverage, the new estimators use $n$ samples to achieve the performance of empirical frequency with sample size $n$ times the logarithm of the property value. Significantly strengthening the traditional min-max formulation, these results hold not only for the worst distributions, but for each and every underlying distribution. Furthermore, the logarithmic amplification factors are optimal. Experiments on a wide variety of distributions show that the new estimators outperform the previous state-of-the-art estimators designed for each specific property.

Cite

Text

Hao and Orlitsky. "Data Amplification: Instance-Optimal Property Estimation." International Conference on Machine Learning, 2020.

Markdown

[Hao and Orlitsky. "Data Amplification: Instance-Optimal Property Estimation." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/hao2020icml-data/)

BibTeX

@inproceedings{hao2020icml-data,
  title     = {{Data Amplification: Instance-Optimal Property Estimation}},
  author    = {Hao, Yi and Orlitsky, Alon},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {4049-4059},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/hao2020icml-data/}
}