Concentration Bounds for Unigram Language Models

Abstract

We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O(k / m1/2). We improve its dependency on k to O(k1/4 / m1/2 + k / m). We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O( 1 / k + k1/2 / m). We derive a combined estimator, which has an error of approximately O(m-2/5), for any k.

Cite

Text

Drukh and Mansour. "Concentration Bounds for Unigram Language Models." Journal of Machine Learning Research, 2005.

Markdown

[Drukh and Mansour. "Concentration Bounds for Unigram Language Models." Journal of Machine Learning Research, 2005.](https://mlanthology.org/jmlr/2005/drukh2005jmlr-concentration/)

BibTeX

@article{drukh2005jmlr-concentration,
  title     = {{Concentration Bounds for Unigram Language Models}},
  author    = {Drukh, Evgeny and Mansour, Yishay},
  journal   = {Journal of Machine Learning Research},
  year      = {2005},
  pages     = {1231-1264},
  volume    = {6},
  url       = {https://mlanthology.org/jmlr/2005/drukh2005jmlr-concentration/}
}