Learning from Uncertain Data

Abstract

The application of statistical methods to natural language processing has been remarkably successful over the past two decades. But, to deal with recent problems arising in this field, machine learning techniques must be generalized to deal with uncertain data , or datasets whose elements are distributions over sequences, such as weighted automata. This paper reviews a number of recent results related to this question. We discuss how to compute efficiently basic statistics from a weighted automaton such as the expected count of an arbitrary sequence and higher moments of that distribution, by using weighted transducers. Both the corresponding transducers and related algorithms are described. We show how general classification techniques such as Support Vector Machines can be extended to deal with distributions by using general kernels between weighted automata. We describe several examples of positive definite kernels between weighted automata such as kernels based on counts of common n -gram sequences, counts of common factors or suffixes, or other more complex kernels, and describe a general algorithm for computing them efficiently. We also demonstrate how machine learning techniques such as clustering based on the edit-distance can be extended to deal with unweighted and weighted automata representing distributions.

Cite

Text

Mohri. "Learning from Uncertain Data." Annual Conference on Computational Learning Theory, 2003. doi:10.1007/978-3-540-45167-9_48

Markdown

[Mohri. "Learning from Uncertain Data." Annual Conference on Computational Learning Theory, 2003.](https://mlanthology.org/colt/2003/mohri2003colt-learning/) doi:10.1007/978-3-540-45167-9_48

BibTeX

@inproceedings{mohri2003colt-learning,
  title     = {{Learning from Uncertain Data}},
  author    = {Mohri, Mehryar},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2003},
  pages     = {656-670},
  doi       = {10.1007/978-3-540-45167-9_48},
  url       = {https://mlanthology.org/colt/2003/mohri2003colt-learning/}
}