Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Abstract

Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or seman- tic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the spe- cific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over gen- eral topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match docu- ments at the specific word level (such as TF-IDF).

Cite

Text

Chemudugunta et al. "Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model." Neural Information Processing Systems, 2006.

Markdown

[Chemudugunta et al. "Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model." Neural Information Processing Systems, 2006.](https://mlanthology.org/neurips/2006/chemudugunta2006neurips-modeling/)

BibTeX

@inproceedings{chemudugunta2006neurips-modeling,
  title     = {{Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model}},
  author    = {Chemudugunta, Chaitanya and Smyth, Padhraic and Steyvers, Mark},
  booktitle = {Neural Information Processing Systems},
  year      = {2006},
  pages     = {241-248},
  url       = {https://mlanthology.org/neurips/2006/chemudugunta2006neurips-modeling/}
}