A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

Abstract

We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-words association pattern is an expression such as (TATA, 30, AGGAGGT) ⇒ C that expresses a rule that if a text contains a subword TATAfollowed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. The optimized confidence pattern problem is to compute frequent patterns (α, k , Β) that optimize the confidence with respect to a given collection of texts. Although this problem is solved in polynomial time by a straightforward algorithm that enumerates all the possible patterns in time O ( n ^5), we focus on the development of more efficient algorithms that can be applied to large text databases. We present an algorithm that solves the optimized confidence pattern problem in time O (max k ; m n ^2) and space O(kn) , where m and n are the number and the total length of classification examples, respectively, and k is a small constant around 30 ∼ 50. This algorithm combines the sufix tree data structure in combinatorial string matching and the orthogonal range query technique in computational geometry for fast computation. Furthermore for most random texts like DNA sequences, we show that a modification of the algorithm runs very efficiently in time O ( kn log^3 n ) and space O(kn) . We also discuss some heuristics such as sampling and pruning as practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences. A relationship with efficient Agnostic PAC-learning is also discussed.

Cite

Text

Arimura et al. "A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases." International Conference on Algorithmic Learning Theory, 1998. doi:10.1007/3-540-49730-7_19

Markdown

[Arimura et al. "A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases." International Conference on Algorithmic Learning Theory, 1998.](https://mlanthology.org/alt/1998/arimura1998alt-fast/) doi:10.1007/3-540-49730-7_19

BibTeX

@inproceedings{arimura1998alt-fast,
  title     = {{A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases}},
  author    = {Arimura, Hiroki and Wataki, Atsushi and Fujino, Ryoichi and Arikawa, Setsuo},
  booktitle = {International Conference on Algorithmic Learning Theory},
  year      = {1998},
  pages     = {247-261},
  doi       = {10.1007/3-540-49730-7_19},
  url       = {https://mlanthology.org/alt/1998/arimura1998alt-fast/}
}