Optimal Estimation of Sparse Topic Models

Abstract

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of $p$ words in $n$ documents, stored in a $p\times n$ matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a $p\times K$ word-topic matrix $A$ and a $K\times n$ topic-document matrix $W$. This paper studies the estimation of $A$ that is possibly element-wise sparse, and the number of topics $K$ is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such $A$ and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of $A$ and our analysis is valid for any finite $n$, $p$, $K$ and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse $A$ and sparse $A$, and has superior performance is many scenarios of interest.

Cite

Text

Bing et al. "Optimal Estimation of Sparse Topic Models." Journal of Machine Learning Research, 2020.

Markdown

[Bing et al. "Optimal Estimation of Sparse Topic Models." Journal of Machine Learning Research, 2020.](https://mlanthology.org/jmlr/2020/bing2020jmlr-optimal/)

BibTeX

@article{bing2020jmlr-optimal,
  title     = {{Optimal Estimation of Sparse Topic Models}},
  author    = {Bing, Xin and Bunea, Florentina and Wegkamp, Marten},
  journal   = {Journal of Machine Learning Research},
  year      = {2020},
  pages     = {1-45},
  volume    = {21},
  url       = {https://mlanthology.org/jmlr/2020/bing2020jmlr-optimal/}
}