A Statistical Approach for Optimal Topic Model Identification
Abstract
Latent Dirichlet Allocation is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.
Cite
Text
Lewis and Grossetti. "A Statistical Approach for Optimal Topic Model Identification." Journal of Machine Learning Research, 2022.Markdown
[Lewis and Grossetti. "A Statistical Approach for Optimal Topic Model Identification." Journal of Machine Learning Research, 2022.](https://mlanthology.org/jmlr/2022/lewis2022jmlr-statistical/)BibTeX
@article{lewis2022jmlr-statistical,
title = {{A Statistical Approach for Optimal Topic Model Identification}},
author = {Lewis, Craig M. and Grossetti, Francesco},
journal = {Journal of Machine Learning Research},
year = {2022},
pages = {1-20},
volume = {23},
url = {https://mlanthology.org/jmlr/2022/lewis2022jmlr-statistical/}
}