Super-Sampling with a Reservoir

Abstract

We introduce an alternative to reservoir sampling, a classic and popular algorithm for drawing a fixed-size subsample from streaming data in a single pass. Rather than draw a random sample, our approach performs an online optimization which aims to select the subset that provides the best overall approximation to the full data set, as judged using a kernel two-sample test. This produces subsets which minimize the worst-case relative error when computing expectations of functions in a specified function class, using just the samples from the subset. Kernel functions are approximated using random Fourier features, and the subset of samples itself is stored in a random projection tree. The resulting algorithm runs in a single pass through the whole data set, and has a per-iteration computational complexity logarithmic in the size of the subset. These "super-samples" subsampled from the full data provide a concise summary, as demonstrated empirically on mixture models and the MNIST dataset.

Cite

Text

Paige et al. "Super-Sampling with a Reservoir." Conference on Uncertainty in Artificial Intelligence, 2016.

Markdown

[Paige et al. "Super-Sampling with a Reservoir." Conference on Uncertainty in Artificial Intelligence, 2016.](https://mlanthology.org/uai/2016/paige2016uai-super/)

BibTeX

@inproceedings{paige2016uai-super,
  title     = {{Super-Sampling with a Reservoir}},
  author    = {Paige, Brooks and Sejdinovic, Dino and Wood, Frank D.},
  booktitle = {Conference on Uncertainty in Artificial Intelligence},
  year      = {2016},
  url       = {https://mlanthology.org/uai/2016/paige2016uai-super/}
}