Nearly-Tight Bounds for Testing Histogram Distributions

Abstract

We investigate the problem of testing whether a discrete probability distribution over an ordered domain is a histogram on a specified number of bins. One of the most common tools for the succinct approximation of data, $k$-histograms over $[n]$, are probability distributions that are piecewise constant over a set of $k$ intervals. Given samples from an unknown distribution $\mathbf p$ on $[n]$, we want to distinguish between the cases that $\mathbf p$ is a $k$-histogram versus far from any $k$-histogram, in total variation distance. Our main result is a sample near-optimal and computationally efficient algorithm for this testing problem, and a nearly-matching (within logarithmic factors) sample complexity lower bound, showing that the testing problem has sample complexity $\widetilde \Theta (\sqrt{nk} / \epsilon + k / \epsilon^2 + \sqrt{n} / \epsilon^2)$.

Cite

Text

Canonne et al. "Nearly-Tight Bounds for Testing Histogram Distributions." Neural Information Processing Systems, 2022.

Markdown

[Canonne et al. "Nearly-Tight Bounds for Testing Histogram Distributions." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/canonne2022neurips-nearlytight/)

BibTeX

@inproceedings{canonne2022neurips-nearlytight,
  title     = {{Nearly-Tight Bounds for Testing Histogram Distributions}},
  author    = {Canonne, Clément L and Diakonikolas, Ilias and Kane, Daniel and Liu, Sihan},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/canonne2022neurips-nearlytight/}
}