Stratified Sampling Meets Machine Learning
Abstract
This paper solves a specialized regression problem to obtain sampling probabilities for records in databases. The goal is to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. We provide a principled and provable solution for this problem; it is parameterless and requires no data insights. Unlike standard regression problems, the loss is inversely proportional to the regressed-to values. Moreover, a cost zero solution always exists and can only be excluded by hard budget constraints. A unique form of regularization is also needed. We provide an efficient and simple regularized Empirical Risk Minimization (ERM) algorithm along with a theoretical generalization result. Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards.
Cite
Text
Liberty et al. "Stratified Sampling Meets Machine Learning." International Conference on Machine Learning, 2016.Markdown
[Liberty et al. "Stratified Sampling Meets Machine Learning." International Conference on Machine Learning, 2016.](https://mlanthology.org/icml/2016/liberty2016icml-stratified/)BibTeX
@inproceedings{liberty2016icml-stratified,
title = {{Stratified Sampling Meets Machine Learning}},
author = {Liberty, Edo and Lang, Kevin and Shmakov, Konstantin},
booktitle = {International Conference on Machine Learning},
year = {2016},
pages = {2320-2329},
volume = {48},
url = {https://mlanthology.org/icml/2016/liberty2016icml-stratified/}
}