Modeling Word Burstiness Using the Dirichlet Distribution

Madsen, Rasmus Elsborg; Kauchak, David; Elkan, Charles

doi:10.1145/1102351.1102420

Modeling Word Burstiness Using the Dirichlet Distribution

Rasmus Elsborg Madsen, David Kauchak, Charles Elkan

ICML 2005 pp. 545-552

doi:10.1145/1102351.1102420 /icml/2005/madsen2005icml-modeling/

Abstract

Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.

PDF ICML Semantic Scholar

Cite

Text

Madsen et al. "Modeling Word Burstiness Using the Dirichlet Distribution." International Conference on Machine Learning, 2005. doi:10.1145/1102351.1102420

Markdown

[Madsen et al. "Modeling Word Burstiness Using the Dirichlet Distribution." International Conference on Machine Learning, 2005.](https://mlanthology.org/icml/2005/madsen2005icml-modeling/) doi:10.1145/1102351.1102420

BibTeX

@inproceedings{madsen2005icml-modeling,
  title     = {{Modeling Word Burstiness Using the Dirichlet Distribution}},
  author    = {Madsen, Rasmus Elsborg and Kauchak, David and Elkan, Charles},
  booktitle = {International Conference on Machine Learning},
  year      = {2005},
  pages     = {545-552},
  doi       = {10.1145/1102351.1102420},
  url       = {https://mlanthology.org/icml/2005/madsen2005icml-modeling/}
}