Adaptive Stratified Sampling for Precision-Recall Estimation
Abstract
We propose a new algorithm for computing a constant-factor approximation of precision-recall (PR) curves for massive noisy datasets produced by generative models. Assessing validity of items in such datasets requires human annotation, which is costly and must be minimized. Our algorithm, AdaStrat, is the first data-aware method for this task. It chooses the next point to query on the PR curve adaptively, based on previous observations. It then selects specific items to annotate using stratified sampling. Under a mild monotonicity assumption, AdaStrat outputs a guaranteed approximation of the underlying precision function, while using a number of annotations that scales very slowly with N, the dataset size. For example, when the minimum precision is bounded by a constant, it issues only log log N precision queries. In general, it has a regret of no more than log log N w.r.t. an oracle that issues queries at data-dependent (unknown) optimal points. On a scaled-up NLP dataset of 3.5M items, AdaStrat achieves a remarkably close approximation of the true precision function using only 18 precision queries, 13x fewer than best previous approaches.
Cite
Text
Sabharwal and Xue. "Adaptive Stratified Sampling for Precision-Recall Estimation." Conference on Uncertainty in Artificial Intelligence, 2018.Markdown
[Sabharwal and Xue. "Adaptive Stratified Sampling for Precision-Recall Estimation." Conference on Uncertainty in Artificial Intelligence, 2018.](https://mlanthology.org/uai/2018/sabharwal2018uai-adaptive/)BibTeX
@inproceedings{sabharwal2018uai-adaptive,
title = {{Adaptive Stratified Sampling for Precision-Recall Estimation}},
author = {Sabharwal, Ashish and Xue, Yexiang},
booktitle = {Conference on Uncertainty in Artificial Intelligence},
year = {2018},
pages = {825-834},
url = {https://mlanthology.org/uai/2018/sabharwal2018uai-adaptive/}
}