Automatic Acquisition of Lexical Knowledge from Sparse and Noisy Data
Abstract
Optical character recognition (OCR) still garbles a considerable amount of information reduction and noise on texts so that many documents are unsuitable for information extraction systems. This paper introduces a statistical method for bootstrapping a lexicon from a very small number of “noisy ,” domain-specific texts. This method determines regularity in grammatical forms and also reoccuring ungrammatical forms from the input text. Through a combination of frequency lists and Levenshtein matrices, a language independent, robust core lexicon is constructed that supports the analysis of “noisy texts,” too.
Cite
Text
Schneider. "Automatic Acquisition of Lexical Knowledge from Sparse and Noisy Data." European Conference on Machine Learning, 1998. doi:10.1007/BFB0026670Markdown
[Schneider. "Automatic Acquisition of Lexical Knowledge from Sparse and Noisy Data." European Conference on Machine Learning, 1998.](https://mlanthology.org/ecmlpkdd/1998/schneider1998ecml-automatic/) doi:10.1007/BFB0026670BibTeX
@inproceedings{schneider1998ecml-automatic,
title = {{Automatic Acquisition of Lexical Knowledge from Sparse and Noisy Data}},
author = {Schneider, René},
booktitle = {European Conference on Machine Learning},
year = {1998},
pages = {43-48},
doi = {10.1007/BFB0026670},
url = {https://mlanthology.org/ecmlpkdd/1998/schneider1998ecml-automatic/}
}