On the Convergence Rate of Good-Turing Estimators

Abstract

Good-Turing adjustments of word frequencies are an important tool in natural language modeling. In particular, for any sample of words, there is a set of words not occuring in that sample. The total probability mass of the words not in the sample is the so-called missing mass. Good showed that the fraction of the sample consisting of words that occur only once in the sample is a nearly unbiased estimate of the missing mass. Here, we give a PACstyle high-probability confidence interval for the actual missing mass. More generally, for k 0, we give a confidence interval for the true probability mass of the set of words occuring k times in the sample. 1 INTRODUCTION Since the publication of the Good-Turing estimators in 1953 [4], these estimators have been used extensively in language modeling applications [2, 3, 6]. In spite of the extensive use of Good-Turing estimators, little theoretical work has been done on these estimators since the original theorems showing that they have negligi...

Cite

Text

McAllester and Schapire. "On the Convergence Rate of Good-Turing Estimators." Annual Conference on Computational Learning Theory, 2000.

Markdown

[McAllester and Schapire. "On the Convergence Rate of Good-Turing Estimators." Annual Conference on Computational Learning Theory, 2000.](https://mlanthology.org/colt/2000/mcallester2000colt-convergence/)

BibTeX

@inproceedings{mcallester2000colt-convergence,
  title     = {{On the Convergence Rate of Good-Turing Estimators}},
  author    = {McAllester, David A. and Schapire, Robert E.},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2000},
  pages     = {1-6},
  url       = {https://mlanthology.org/colt/2000/mcallester2000colt-convergence/}
}