Stochastic Grammatical Inference of Text Database Structure

Abstract

For a document collection in which structural elements are identified with markup, it is often necessary to construct a grammar retrospectively that constrains element nesting and ordering. This has been addressed by others as an application of grammatical inference. We describe an approach based on stochastic grammatical inference which scales more naturally to large data sets and produces models with richer semantics. We adopt an algorithm that produces stochastic finite automata and describe modifications that enable better interactive control of results. Our experimental evaluation uses four document collections with varying structure.

Cite

Text

Young-Lai and Tompa. "Stochastic Grammatical Inference of Text Database Structure." Machine Learning, 2000. doi:10.1023/A:1007653929870

Markdown

[Young-Lai and Tompa. "Stochastic Grammatical Inference of Text Database Structure." Machine Learning, 2000.](https://mlanthology.org/mlj/2000/younglai2000mlj-stochastic/) doi:10.1023/A:1007653929870

BibTeX

@article{younglai2000mlj-stochastic,
  title     = {{Stochastic Grammatical Inference of Text Database Structure}},
  author    = {Young-Lai, Matthew and Tompa, Frank Wm.},
  journal   = {Machine Learning},
  year      = {2000},
  pages     = {111-137},
  doi       = {10.1023/A:1007653929870},
  volume    = {40},
  url       = {https://mlanthology.org/mlj/2000/younglai2000mlj-stochastic/}
}