How Much Is Unseen Depends Chiefly on Information About the Seen

Abstract

The *missing mass* refers to the proportion of data points in an *unknown* population of classifier inputs that belong to classes *not* present in the classifier's training data, which is assumed to be a random sample from that unknown population. We find that *in expectation* the missing mass is entirely determined by the number $f_k$ of classes that *do* appear in the training data the same number of times *and an exponentially decaying error*. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93\% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80\% of the Good-Turing estimator's.

Cite

Text

Lee and Boehme. "How Much Is Unseen Depends Chiefly on Information About the Seen." International Conference on Learning Representations, 2025.

Markdown

[Lee and Boehme. "How Much Is Unseen Depends Chiefly on Information About the Seen." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/lee2025iclr-much/)

BibTeX

@inproceedings{lee2025iclr-much,
  title     = {{How Much Is Unseen Depends Chiefly on Information About the Seen}},
  author    = {Lee, Seongmin and Boehme, Marcel},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/lee2025iclr-much/}
}