Unsupervised Context Sensitive Language Acquisition from a Large Corpus

Abstract

We describe a pattern acquisition algorithm that learns, in an unsuper- vised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learn- ing structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm repre- sents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are rep- resented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be sub- jected to a standard test of English as a Second Language (ESL) profi- ciency. The results are encouraging: the model attains a level of per- formance considered to be “intermediate” for 9th-grade students, de- spite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.

Cite

Text

Solan et al. "Unsupervised Context Sensitive Language Acquisition from a Large Corpus." Neural Information Processing Systems, 2003.

Markdown

[Solan et al. "Unsupervised Context Sensitive Language Acquisition from a Large Corpus." Neural Information Processing Systems, 2003.](https://mlanthology.org/neurips/2003/solan2003neurips-unsupervised/)

BibTeX

@inproceedings{solan2003neurips-unsupervised,
  title     = {{Unsupervised Context Sensitive Language Acquisition from a Large Corpus}},
  author    = {Solan, Zach and Horn, David and Ruppin, Eytan and Edelman, Shimon},
  booktitle = {Neural Information Processing Systems},
  year      = {2003},
  pages     = {961-968},
  url       = {https://mlanthology.org/neurips/2003/solan2003neurips-unsupervised/}
}