Fast Algorithms for Learning with Long N-Grams via Suffix Tree Based Matrix Multiplication
Abstract
This paper addresses the computational and statistical issues of learning with long -- and possibly all -- $N$-grams in a document corpus. We leverage the rich algebraic structure of $N$-gram matrices to provide a data structure which can store and multiply any $N$-gram matrix in memory and time that is, at worst, linear in the length of the corpus from which it is derived. As matrix-vector multiplication lies at the heart of most machine learning algorithms, our algorithm can speed up any learning procedure that uses $N$-gram features and has such structure. We also provide an efficient, linear running time and memory, framework that produces our data structure and screens $N$-gram features according to a multitude of statistical criteria. We demonstrate the performance of our algorithm on natural language and DNA sequence datasets; the computational and memory savings are substantial. Finally, we apply our framework to several large-scale sentiment analysis problems involving millions of reviews over gigabytes of text and show that higher-order $N$-grams can substantially improve performance.
Cite
Text
Paskov et al. "Fast Algorithms for Learning with Long N-Grams via Suffix Tree Based Matrix Multiplication." Conference on Uncertainty in Artificial Intelligence, 2015.Markdown
[Paskov et al. "Fast Algorithms for Learning with Long N-Grams via Suffix Tree Based Matrix Multiplication." Conference on Uncertainty in Artificial Intelligence, 2015.](https://mlanthology.org/uai/2015/paskov2015uai-fast/)BibTeX
@inproceedings{paskov2015uai-fast,
title = {{Fast Algorithms for Learning with Long N-Grams via Suffix Tree Based Matrix Multiplication}},
author = {Paskov, Hristo S. and Mitchell, John C. and Hastie, Trevor J.},
booktitle = {Conference on Uncertainty in Artificial Intelligence},
year = {2015},
pages = {672-681},
url = {https://mlanthology.org/uai/2015/paskov2015uai-fast/}
}