Fast Algorithms for Learning with Long N-Grams via Suffix Tree Based Matrix Multiplication

Abstract

This paper addresses the computational and statistical issues of learning with long -- and possibly all -- $N$-grams in a document corpus. We leverage the rich algebraic structure of $N$-gram matrices to provide a data structure which can store and multiply any $N$-gram matrix in memory and time that is, at worst, linear in the length of the corpus from which it is derived. As matrix-vector multiplication lies at the heart of most machine learning algorithms, our algorithm can speed up any learning procedure that uses $N$-gram features and has such structure. We also provide an efficient, linear running time and memory, framework that produces our data structure and screens $N$-gram features according to a multitude of statistical criteria. We demonstrate the performance of our algorithm on natural language and DNA sequence datasets; the computational and memory savings are substantial. Finally, we apply our framework to several large-scale sentiment analysis problems involving millions of reviews over gigabytes of text and show that higher-order $N$-grams can substantially improve performance.

Cite

Text

Paskov et al. "Fast Algorithms for Learning with Long N-Grams via Suffix Tree Based Matrix Multiplication." Conference on Uncertainty in Artificial Intelligence, 2015.

Markdown

[Paskov et al. "Fast Algorithms for Learning with Long N-Grams via Suffix Tree Based Matrix Multiplication." Conference on Uncertainty in Artificial Intelligence, 2015.](https://mlanthology.org/uai/2015/paskov2015uai-fast/)

BibTeX

@inproceedings{paskov2015uai-fast,
  title     = {{Fast Algorithms for Learning with Long N-Grams via Suffix Tree Based Matrix Multiplication}},
  author    = {Paskov, Hristo S. and Mitchell, John C. and Hastie, Trevor J.},
  booktitle = {Conference on Uncertainty in Artificial Intelligence},
  year      = {2015},
  pages     = {672-681},
  url       = {https://mlanthology.org/uai/2015/paskov2015uai-fast/}
}