Stratified Sampling Meets Machine Learning

Abstract

This paper solves a specialized regression problem to obtain sampling probabilities for records in databases. The goal is to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. We provide a principled and provable solution for this problem; it is parameterless and requires no data insights. Unlike standard regression problems, the loss is inversely proportional to the regressed-to values. Moreover, a cost zero solution always exists and can only be excluded by hard budget constraints. A unique form of regularization is also needed. We provide an efficient and simple regularized Empirical Risk Minimization (ERM) algorithm along with a theoretical generalization result. Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards.

Cite

Text

Liberty et al. "Stratified Sampling Meets Machine Learning." International Conference on Machine Learning, 2016.

Markdown

[Liberty et al. "Stratified Sampling Meets Machine Learning." International Conference on Machine Learning, 2016.](https://mlanthology.org/icml/2016/liberty2016icml-stratified/)

BibTeX

@inproceedings{liberty2016icml-stratified,
  title     = {{Stratified Sampling Meets Machine Learning}},
  author    = {Liberty, Edo and Lang, Kevin and Shmakov, Konstantin},
  booktitle = {International Conference on Machine Learning},
  year      = {2016},
  pages     = {2320-2329},
  volume    = {48},
  url       = {https://mlanthology.org/icml/2016/liberty2016icml-stratified/}
}