How Much Is Unseen Depends Chiefly on Information About the Seen
Abstract
The *missing mass* refers to the proportion of data points in an *unknown* population of classifier inputs that belong to classes *not* present in the classifier's training data, which is assumed to be a random sample from that unknown population. We find that *in expectation* the missing mass is entirely determined by the number $f_k$ of classes that *do* appear in the training data the same number of times *and an exponentially decaying error*. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93\% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80\% of the Good-Turing estimator's.
Cite
Text
Lee and Boehme. "How Much Is Unseen Depends Chiefly on Information About the Seen." International Conference on Learning Representations, 2025.Markdown
[Lee and Boehme. "How Much Is Unseen Depends Chiefly on Information About the Seen." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/lee2025iclr-much/)BibTeX
@inproceedings{lee2025iclr-much,
title = {{How Much Is Unseen Depends Chiefly on Information About the Seen}},
author = {Lee, Seongmin and Boehme, Marcel},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/lee2025iclr-much/}
}