Unsupervised Context Sensitive Language Acquisition from a Large Corpus
Abstract
We describe a pattern acquisition algorithm that learns, in an unsuper- vised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learn- ing structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm repre- sents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are rep- resented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be sub- jected to a standard test of English as a Second Language (ESL) profi- ciency. The results are encouraging: the model attains a level of per- formance considered to be “intermediate” for 9th-grade students, de- spite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.
Cite
Text
Solan et al. "Unsupervised Context Sensitive Language Acquisition from a Large Corpus." Neural Information Processing Systems, 2003.Markdown
[Solan et al. "Unsupervised Context Sensitive Language Acquisition from a Large Corpus." Neural Information Processing Systems, 2003.](https://mlanthology.org/neurips/2003/solan2003neurips-unsupervised/)BibTeX
@inproceedings{solan2003neurips-unsupervised,
title = {{Unsupervised Context Sensitive Language Acquisition from a Large Corpus}},
author = {Solan, Zach and Horn, David and Ruppin, Eytan and Edelman, Shimon},
booktitle = {Neural Information Processing Systems},
year = {2003},
pages = {961-968},
url = {https://mlanthology.org/neurips/2003/solan2003neurips-unsupervised/}
}