A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Abstract

In this paper we propose a probabilistic model for online document clus- tering. We use non-parametric Dirichlet process prior to model the grow- ing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichlet- multinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.

Cite

Text

Zhang et al. "A Probabilistic Model for Online Document Clustering with Application to Novelty Detection." Neural Information Processing Systems, 2004.

Markdown

[Zhang et al. "A Probabilistic Model for Online Document Clustering with Application to Novelty Detection." Neural Information Processing Systems, 2004.](https://mlanthology.org/neurips/2004/zhang2004neurips-probabilistic/)

BibTeX

@inproceedings{zhang2004neurips-probabilistic,
  title     = {{A Probabilistic Model for Online Document Clustering with Application to Novelty Detection}},
  author    = {Zhang, Jian and Ghahramani, Zoubin and Yang, Yiming},
  booktitle = {Neural Information Processing Systems},
  year      = {2004},
  pages     = {1617-1624},
  url       = {https://mlanthology.org/neurips/2004/zhang2004neurips-probabilistic/}
}