Part-of-Speech Tagging from "Small" Data Sets
Abstract
Many probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the LOB. Using the hidden Markov model method on a 900,000 token training corpus, it is not difficult achieve a success rate of 95 per cent on a 100,000 token test corpus. However, even such large training corpora contain few relatively few words. For example, the LOB contains about 45,000 words, most of which occur only once or twice. As a result, 3-4 per cent of tokens in the test corpus are unseen and cause a significant proportion of errors. A corpus large enough to accurately represent all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. This work argues this may not be necessary, describing variations on HMM-based tagging that facilitate learning from relatively little data, including ending-based approaches, incremental learning strategies, and the use of approximate distributions.
Cite
Text
Neufeld et al. "Part-of-Speech Tagging from "Small" Data Sets." Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, 1995.Markdown
[Neufeld et al. "Part-of-Speech Tagging from "Small" Data Sets." Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, 1995.](https://mlanthology.org/aistats/1995/neufeld1995aistats-partofspeech/)BibTeX
@inproceedings{neufeld1995aistats-partofspeech,
title = {{Part-of-Speech Tagging from "Small" Data Sets}},
author = {Neufeld, Eric and Adams, Greg and Choy, Henry and Orthner, Ron and Philip, Tim and Tawfik, Ahmed},
booktitle = {Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics},
year = {1995},
pages = {410-416},
volume = {R0},
url = {https://mlanthology.org/aistats/1995/neufeld1995aistats-partofspeech/}
}