Concentration Bounds for Unigrams Language Model

Abstract

We show several PAC-style concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m . A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a PAC bound of approximately $O(\frac{k}{\sqrt{m}})$ . We improve its dependency on k to $O(\frac{\sqrt[4]{k}}{\sqrt{m}}+\frac{k}{m})$ . We also analyze the empirical frequencies estimator, showing that its PAC error bound is approximately $O(\frac{1}{k}+\sqrt{k}{m})$ . We derive a combined estimator, which has an error of approximately $O(m-\frac{2}{5})$ , for any k . A standard measure for the quality of a learning algorithm is its expected per-word log-loss. We show that the leave-one-out method can be used for estimating the log-loss of the unigrams model with a PAC error of approximately $O(\frac{1}{\sqrt{m}})$ , for any distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution.

Cite

Text

Drukh and Mansour. "Concentration Bounds for Unigrams Language Model." Annual Conference on Computational Learning Theory, 2004. doi:10.1007/978-3-540-27819-1_12

Markdown

[Drukh and Mansour. "Concentration Bounds for Unigrams Language Model." Annual Conference on Computational Learning Theory, 2004.](https://mlanthology.org/colt/2004/drukh2004colt-concentration/) doi:10.1007/978-3-540-27819-1_12

BibTeX

@inproceedings{drukh2004colt-concentration,
  title     = {{Concentration Bounds for Unigrams Language Model}},
  author    = {Drukh, Evgeny and Mansour, Yishay},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2004},
  pages     = {170-185},
  doi       = {10.1007/978-3-540-27819-1_12},
  url       = {https://mlanthology.org/colt/2004/drukh2004colt-concentration/}
}