A Spectral Algorithm for Learning Class-Based N-Gram Models of Natural Language
Abstract
The Brown clustering algorithm (Brown et al., 1992) is widely used in natural language process-ing (NLP) to derive lexical representations that are then used to improve performance on vari-ous NLP problems. The algorithm assumes an underlying model that is essentially an HMM, with the restriction that each word in the vocab-ulary is emitted from a single state. A greedy, bottom-up method is then used to find the clus-tering; this method does not have a guarantee of finding the correct underlying clustering. In this paper we describe a new algorithm for clustering under the Brown et al. model. The method relies on two steps: first, the use of canonical correla-tion analysis to derive a low-dimensional repre-sentation of words; second, a bottom-up hierar-chical clustering over these representations. We show that given a sufficient number of training examples sampled from the Brown et al. model, the method is guaranteed to recover the correct clustering. Experiments show that the method recovers clusters of comparable quality to the al-gorithm of Brown et al. (1992), but is an order of magnitude more efficient. 1
Cite
Text
Stratos et al. "A Spectral Algorithm for Learning Class-Based N-Gram Models of Natural Language." Conference on Uncertainty in Artificial Intelligence, 2014.Markdown
[Stratos et al. "A Spectral Algorithm for Learning Class-Based N-Gram Models of Natural Language." Conference on Uncertainty in Artificial Intelligence, 2014.](https://mlanthology.org/uai/2014/stratos2014uai-spectral/)BibTeX
@inproceedings{stratos2014uai-spectral,
title = {{A Spectral Algorithm for Learning Class-Based N-Gram Models of Natural Language}},
author = {Stratos, Karl and Kim, Do-kyum and Collins, Michael and Hsu, Daniel J.},
booktitle = {Conference on Uncertainty in Artificial Intelligence},
year = {2014},
pages = {762-771},
url = {https://mlanthology.org/uai/2014/stratos2014uai-spectral/}
}