Sphere Embedding: An Application to Part-of-Speech Induction

Abstract

Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a high-dimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks.

Cite

Text

Maron et al. "Sphere Embedding: An Application to Part-of-Speech Induction." Neural Information Processing Systems, 2010.

Markdown

[Maron et al. "Sphere Embedding: An Application to Part-of-Speech Induction." Neural Information Processing Systems, 2010.](https://mlanthology.org/neurips/2010/maron2010neurips-sphere/)

BibTeX

@inproceedings{maron2010neurips-sphere,
  title     = {{Sphere Embedding: An Application to Part-of-Speech Induction}},
  author    = {Maron, Yariv and Lamar, Michael and Bienenstock, Elie},
  booktitle = {Neural Information Processing Systems},
  year      = {2010},
  pages     = {1567-1575},
  url       = {https://mlanthology.org/neurips/2010/maron2010neurips-sphere/}
}