Automatic Acquisition of Lexical Knowledge from Sparse and Noisy Data

Abstract

Optical character recognition (OCR) still garbles a considerable amount of information reduction and noise on texts so that many documents are unsuitable for information extraction systems. This paper introduces a statistical method for bootstrapping a lexicon from a very small number of “noisy ,” domain-specific texts. This method determines regularity in grammatical forms and also reoccuring ungrammatical forms from the input text. Through a combination of frequency lists and Levenshtein matrices, a language independent, robust core lexicon is constructed that supports the analysis of “noisy texts,” too.

Cite

Text

Schneider. "Automatic Acquisition of Lexical Knowledge from Sparse and Noisy Data." European Conference on Machine Learning, 1998. doi:10.1007/BFB0026670

Markdown

[Schneider. "Automatic Acquisition of Lexical Knowledge from Sparse and Noisy Data." European Conference on Machine Learning, 1998.](https://mlanthology.org/ecmlpkdd/1998/schneider1998ecml-automatic/) doi:10.1007/BFB0026670

BibTeX

@inproceedings{schneider1998ecml-automatic,
  title     = {{Automatic Acquisition of Lexical Knowledge from Sparse and Noisy Data}},
  author    = {Schneider, René},
  booktitle = {European Conference on Machine Learning},
  year      = {1998},
  pages     = {43-48},
  doi       = {10.1007/BFB0026670},
  url       = {https://mlanthology.org/ecmlpkdd/1998/schneider1998ecml-automatic/}
}