Topic Modeling: Beyond Bag-of-Words
Abstract
Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.
Cite
Text
Wallach. "Topic Modeling: Beyond Bag-of-Words." International Conference on Machine Learning, 2006. doi:10.1145/1143844.1143967Markdown
[Wallach. "Topic Modeling: Beyond Bag-of-Words." International Conference on Machine Learning, 2006.](https://mlanthology.org/icml/2006/wallach2006icml-topic/) doi:10.1145/1143844.1143967BibTeX
@inproceedings{wallach2006icml-topic,
title = {{Topic Modeling: Beyond Bag-of-Words}},
author = {Wallach, Hanna M.},
booktitle = {International Conference on Machine Learning},
year = {2006},
pages = {977-984},
doi = {10.1145/1143844.1143967},
url = {https://mlanthology.org/icml/2006/wallach2006icml-topic/}
}